Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12065 )
Change subject: IMPALA-5843: Use page index in Parquet files to skip pages ...................................................................... Patch Set 5: (7 comments) >From PS5 it is no longer WIP. It should be functionally complete and tested. http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-column-readers.cc@399 PS4, Line 399: DCHECK_EQ(page_encoding_, Encoding::PLAIN); > Please add DCHECK_EQ(page_encoding_, parquet::Encoding::PLAIN); Done http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-column-readers.cc@1088 PS4, Line 1088: > This can be hit in a corrupt Parquet file. Removed DCHECK. http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-common.h File be/src/exec/parquet/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-common.h@229 PS4, Line 229: 4_t Encode > Can you add a comment for the function? It may make sense to mention "Skip" Added a short comment. I like the current function name, because it really just returns the encoded length. Currently it is only used for value-skipping, but it might turn out to be useful for other stuff in the future. http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-common.h@230 PS4, Line 230: int32_t > This topic was touched in https://gerrit.cloudera.org/#/c/12172/ , but I wo Done http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-common.h@241 PS4, Line 241: DCHECK(false); > It would be better to return -1 then 0 in this case. Done http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-common.h@244 PS4, Line 244: int64_t encoded_len = byte_size * num_values; > The specialization for BYTE_ARRAY checks if the result will not pass buffer Added check for the buffer end. Changed result type to int64_t. http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-page-index.cc File be/src/exec/parquet/parquet-page-index.cc: http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-page-index.cc@107 PS4, Line 107: } : DCHECK_GE(file_offset, base_offset_); > file_offset and length come from thrift structures and are not validated at In ReadAll() we iterate through the whole page index when we calculate 'base_offset_' and 'max_offset_'. Here we shouldn't see other values beside the ones we already seen in ReadAll(). If we do, then we have some logic error in the code. -- To view, visit http://gerrit.cloudera.org:8080/12065 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a Gerrit-Change-Number: 12065 Gerrit-PatchSet: 5 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Wed, 27 Feb 2019 16:25:53 +0000 Gerrit-HasComments: Yes