Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/24063 )

Change subject: IMPALA-14794: Implement small string optimization in Parquet 
scanner
......................................................................


Patch Set 7:

(7 comments)

Thanks for working on this!

http://gerrit.cloudera.org:8080/#/c/24063/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/24063/7//COMMIT_MSG@20
PS7, Line 20: Measured time of a join query where lots of small strings are 
sent from
            : the reader:
            :   select l1.* from tpch_parquet.lineitem l1
            :     join tpch_parquet.lineitem l2 on l1.l_shipmode = l2.l_shipmode
            :     limit 1;
Can you also run one of the following jobs?

* https://jenkins.impala.io/job/perf-AB-test-ub2204/
* https://jenkins.impala.io/job/perf-AB-test-ub2004/

Both TPC-H and TPC-DS can be interesting.


http://gerrit.cloudera.org:8080/#/c/24063/7//COMMIT_MSG@26
PS7, Line 26: Before:
            :   KrpcDataStreamSender: SerializeBatchTime: ~90.0ms
            : After:
            :   KrpcDataStreamSender: SerializeBatchTime: ~65.0ms
Can you include other metrics, e.g. scan times (e.g. MaterializeTupleTime)? 
Also, for better stability, you could user a higher scale factor.

You could load it via

 bin/load-data.py -s 30 -f --workloads tpch --table_formats 
text/none,parquet/snap


http://gerrit.cloudera.org:8080/#/c/24063/7/be/src/exec/parquet/parquet-column-chunk-reader.h
File be/src/exec/parquet/parquet-column-chunk-reader.h:

http://gerrit.cloudera.org:8080/#/c/24063/7/be/src/exec/parquet/parquet-column-chunk-reader.h@176
PS7, Line 176: by the row batch.
nit: by the last row batch.


http://gerrit.cloudera.org:8080/#/c/24063/7/be/src/exec/parquet/parquet-column-chunk-reader.cc
File be/src/exec/parquet/parquet-column-chunk-reader.cc:

http://gerrit.cloudera.org:8080/#/c/24063/7/be/src/exec/parquet/parquet-column-chunk-reader.cc@173
PS7, Line 173: the row batch.
nit: the last row batch.


http://gerrit.cloudera.org:8080/#/c/24063/7/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/24063/7/be/src/exec/parquet/parquet-column-readers.cc@837
PS7, Line 837:       all_strings_smallified = false;
We could add DCHECK_FALSE(val.Smallify());

To verify we didn't miss the opportunity to smallify the string.


http://gerrit.cloudera.org:8080/#/c/24063/7/be/src/exec/parquet/parquet-column-readers.cc@870
PS7, Line 870:     col_chunk_reader_.keep_data_page_pool_ = true;
We could add DCHECK_FALSE(val.Smallify());


http://gerrit.cloudera.org:8080/#/c/24063/7/be/src/exec/parquet/parquet-column-readers.cc@963
PS7, Line 963:       if (!val->IsSmall()) {
We could add DCHECK_FALSE(val.Smallify());



--
To view, visit http://gerrit.cloudera.org:8080/24063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I16c550d35cd6d3ec259b899b325611294137ccef
Gerrit-Change-Number: 24063
Gerrit-PatchSet: 7
Gerrit-Owner: Balazs Hevele <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Tue, 10 Mar 2026 17:40:15 +0000
Gerrit-HasComments: Yes

Reply via email to