Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/19793 )
Change subject: IMPALA-6433: Add read support for PageHeaderV2 ...................................................................... Patch Set 10: (27 comments) http://gerrit.cloudera.org:8080/#/c/19793/8//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19793/8//COMMIT_MSG@41 PS8, Line 41: ti > Nit: should be a single 't'. Done http://gerrit.cloudera.org:8080/#/c/19793/8//COMMIT_MSG@60 PS8, Line 60: ti > Nit: single 't'. Done http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.h File be/src/exec/parquet/parquet-column-chunk-reader.h: http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.h@123 PS8, Line 123: non-empty data page > Nit: could be "non-empty data page's page header"? Done http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.h@123 PS8, Line 123: h the > Nit: "the client" would be better. Done http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.h@125 PS8, Line 125: other types of pages > Could mention that we also skip empty data pages also here. Done http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.h@170 PS8, Line 170: > Nit: maybe 'has_{rep/def}_levels_' (plural) would sound better. I prefer the singular form in this case, not sure why :) http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc File be/src/exec/parquet/parquet-column-chunk-reader.cc: http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@211 PS8, Line 211: skipping page types we don't care about > And also empty pages. Done http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@260 PS8, Line 260: > Can this be NULL if the header is not v2 or if 'is_compressed' is set? Isn' It is possible to encounter files like this and it is not clearly an error (is_compressed is optional and true by default) http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@275 PS8, Line 275: RETURN_IF_ERROR(ParquetLevelDecoder::ParseRleByteSize( > Can we extract this block to a function? Done http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@284 PS8, Line 284: > It may be a bit misleading to blame 'def_level_size' if for example the 're Changed the error message. http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@314 PS8, Line 314: Type::DATA_PAGE_V > In my opinion it would be cleaner if we didn't reuse 'uncompressed_size' bu I changed it but I am not sure that it is better like this - it is used as both input and output argument in ProcessBlock32 so it must be set to the expected size (used as maximum size) and will be overwritten by the actual size. http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@367 PS8, Line 367: data = decompressed_buffer; > Isn't this the other way round? If v2 page, rep/def levels have already bee Done http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@369 PS8, Line 369: if (has_slot_desc) { > Can we extract this to a function? Done http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@399 PS8, Line 399: if (has_slot_desc) { : // Use original sizes (includes levels in v2) in th > We have already filled these fields on L292-293 and L370-371. Done http://gerrit.cloudera.org:8080/#/c/19793/9/be/src/exec/parquet/parquet-column-chunk-reader.cc File be/src/exec/parquet/parquet-column-chunk-reader.cc: http://gerrit.cloudera.org:8080/#/c/19793/9/be/src/exec/parquet/parquet-column-chunk-reader.cc@286 PS9, Line 286: Status ParquetColumnChunkReader::ProcessRepDefLevelsInDataPageV2( > line too long (108 > 90) Done http://gerrit.cloudera.org:8080/#/c/19793/9/be/src/exec/parquet/parquet-column-chunk-reader.cc@333 PS9, Line 333: if (is_v2) { > line too long (91 > 90) Done http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-readers.cc@359 PS8, Line 359: PLAIN_DICTIONARY > Why not RLE_DICTIONARY? Kept it as PLAIN_DICTIONARY to avoid touching unrelated parts of the code. I think that it would be better to clean this up when Impala will actually switch to writing V2 pages by default, as currently we mainly see PLAIN_DICTIONARY encodings. http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-readers.cc@1145 PS8, Line 1145: Status BaseScalarColumnReader::ReadDataPage() { > Can't we reuse ReadNextDataPageHeader() and possibly also ReadCurrentDataPa I thought about this too but wouldn't to it in this patch. The difference of these functions comes for skipping vs normal reading context and can have some subtle differences. http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-level-decoder.h File be/src/exec/parquet/parquet-level-decoder.h: http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-level-decoder.h@56 PS8, Line 56: > Should be int32_t, see comment for L59. done - int32_t vs int is used a bit chaotically in Impala and I think that lot of things would break if int was not 32 bit http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-level-decoder.h@58 PS8, Line 58: static Status ParseRleByteSize(const string& filename, > Could add a short description. Done http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-level-decoder.h@59 PS8, Line 59: int > We should use a fixed length type, for example int32_t as the thrift type i Done http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-level-decoder.cc File be/src/exec/parquet/parquet-level-decoder.cc: http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-level-decoder.cc@68 PS8, Line 68: Init > Before the change Init() used to do validation but now it doesn't, the call Removed 'encoding' and added DCHECKs for it in the caller code. It seems very unlikely that we will ever support another level encoding. http://gerrit.cloudera.org:8080/#/c/19793/8/testdata/datasets/functional/functional_schema_template.sql File testdata/datasets/functional/functional_schema_template.sql: http://gerrit.cloudera.org:8080/#/c/19793/8/testdata/datasets/functional/functional_schema_template.sql@4099 PS8, Line 4099: alltypesagg_parquet_v2 > Why do we have this table twice? Probably was a mistake. http://gerrit.cloudera.org:8080/#/c/19793/8/testdata/workloads/functional-query/queries/QueryTest/parquet-v2.test File testdata/workloads/functional-query/queries/QueryTest/parquet-v2.test: http://gerrit.cloudera.org:8080/#/c/19793/8/testdata/workloads/functional-query/queries/QueryTest/parquet-v2.test@44 PS8, Line 44: ==== > Nit: Could be moved to the test of def levels on L12. Moved it before the error messages. It checks both value and rep level decoding so I wouldn't put it before line 33 http://gerrit.cloudera.org:8080/#/c/19793/8/tests/query_test/test_scanners_fuzz.py File tests/query_test/test_scanners_fuzz.py: http://gerrit.cloudera.org:8080/#/c/19793/8/tests/query_test/test_scanners_fuzz.py@168 PS8, Line 168: ueries = [ > Could use the variable 'table_name'. Done http://gerrit.cloudera.org:8080/#/c/19793/8/tests/query_test/test_scanners_fuzz.py@174 PS8, Line 174: > Could use the variable 'table_name'. Done http://gerrit.cloudera.org:8080/#/c/19793/9/tests/query_test/test_scanners_fuzz.py File tests/query_test/test_scanners_fuzz.py: http://gerrit.cloudera.org:8080/#/c/19793/9/tests/query_test/test_scanners_fuzz.py@170 PS9, Line 170: > flake8: E221 multiple spaces before operator Done -- To view, visit http://gerrit.cloudera.org:8080/19793 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I282962a6e4611e2b662c04a81592af83ecaf08ca Gerrit-Change-Number: 19793 Gerrit-PatchSet: 10 Gerrit-Owner: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Comment-Date: Wed, 03 May 2023 18:42:55 +0000 Gerrit-HasComments: Yes
