Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/24063 )
Change subject: IMPALA-14794: Implement small string optimization in Parquet scanner ...................................................................... Patch Set 7: (7 comments) Thanks for working on this! http://gerrit.cloudera.org:8080/#/c/24063/7//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/24063/7//COMMIT_MSG@20 PS7, Line 20: Measured time of a join query where lots of small strings are sent from : the reader: : select l1.* from tpch_parquet.lineitem l1 : join tpch_parquet.lineitem l2 on l1.l_shipmode = l2.l_shipmode : limit 1; Can you also run one of the following jobs? * https://jenkins.impala.io/job/perf-AB-test-ub2204/ * https://jenkins.impala.io/job/perf-AB-test-ub2004/ Both TPC-H and TPC-DS can be interesting. http://gerrit.cloudera.org:8080/#/c/24063/7//COMMIT_MSG@26 PS7, Line 26: Before: : KrpcDataStreamSender: SerializeBatchTime: ~90.0ms : After: : KrpcDataStreamSender: SerializeBatchTime: ~65.0ms Can you include other metrics, e.g. scan times (e.g. MaterializeTupleTime)? Also, for better stability, you could user a higher scale factor. You could load it via bin/load-data.py -s 30 -f --workloads tpch --table_formats text/none,parquet/snap http://gerrit.cloudera.org:8080/#/c/24063/7/be/src/exec/parquet/parquet-column-chunk-reader.h File be/src/exec/parquet/parquet-column-chunk-reader.h: http://gerrit.cloudera.org:8080/#/c/24063/7/be/src/exec/parquet/parquet-column-chunk-reader.h@176 PS7, Line 176: by the row batch. nit: by the last row batch. http://gerrit.cloudera.org:8080/#/c/24063/7/be/src/exec/parquet/parquet-column-chunk-reader.cc File be/src/exec/parquet/parquet-column-chunk-reader.cc: http://gerrit.cloudera.org:8080/#/c/24063/7/be/src/exec/parquet/parquet-column-chunk-reader.cc@173 PS7, Line 173: the row batch. nit: the last row batch. http://gerrit.cloudera.org:8080/#/c/24063/7/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/24063/7/be/src/exec/parquet/parquet-column-readers.cc@837 PS7, Line 837: all_strings_smallified = false; We could add DCHECK_FALSE(val.Smallify()); To verify we didn't miss the opportunity to smallify the string. http://gerrit.cloudera.org:8080/#/c/24063/7/be/src/exec/parquet/parquet-column-readers.cc@870 PS7, Line 870: col_chunk_reader_.keep_data_page_pool_ = true; We could add DCHECK_FALSE(val.Smallify()); http://gerrit.cloudera.org:8080/#/c/24063/7/be/src/exec/parquet/parquet-column-readers.cc@963 PS7, Line 963: if (!val->IsSmall()) { We could add DCHECK_FALSE(val.Smallify()); -- To view, visit http://gerrit.cloudera.org:8080/24063 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I16c550d35cd6d3ec259b899b325611294137ccef Gerrit-Change-Number: 24063 Gerrit-PatchSet: 7 Gerrit-Owner: Balazs Hevele <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Tue, 10 Mar 2026 17:40:15 +0000 Gerrit-HasComments: Yes
