Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12065 )

Change subject: IMPALA-5843: Use page index in Parquet files to skip pages
......................................................................


Patch Set 5:

(7 comments)

>From PS5 it is no longer WIP. It should be functionally complete and tested.

http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-column-readers.cc@399
PS4, Line 399:     DCHECK_EQ(page_encoding_, Encoding::PLAIN);
> Please add  DCHECK_EQ(page_encoding_, parquet::Encoding::PLAIN);
Done


http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-column-readers.cc@1088
PS4, Line 1088:
> This can be hit in a corrupt Parquet file.
Removed DCHECK.


http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-common.h
File be/src/exec/parquet/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-common.h@229
PS4, Line 229: 4_t Encode
> Can you add a comment for the function? It may make sense to mention "Skip"
Added a short comment. I like the current function name, because it really just 
returns the encoded length.

Currently it is only used for value-skipping, but it might turn out to be 
useful for other stuff in the future.


http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-common.h@230
PS4, Line 230: int32_t
> This topic was touched in https://gerrit.cloudera.org/#/c/12172/ , but I wo
Done


http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-common.h@241
PS4, Line 241:         DCHECK(false);
> It would be better to return -1 then 0 in this case.
Done


http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-common.h@244
PS4, Line 244:     int64_t encoded_len = byte_size * num_values;
> The specialization for BYTE_ARRAY checks if the result will not pass buffer
Added check for the buffer end.
Changed result type to int64_t.


http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-page-index.cc
File be/src/exec/parquet/parquet-page-index.cc:

http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-page-index.cc@107
PS4, Line 107:   }
             :   DCHECK_GE(file_offset, base_offset_);
> file_offset and length come from thrift structures and are not validated at
In ReadAll() we iterate through the whole page index when we calculate 
'base_offset_' and 'max_offset_'.

Here we shouldn't see other values beside the ones we already seen in 
ReadAll(). If we do, then we have some logic error in the code.



--
To view, visit http://gerrit.cloudera.org:8080/12065
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a
Gerrit-Change-Number: 12065
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Feb 2019 16:25:53 +0000
Gerrit-HasComments: Yes

Reply via email to