Hello Zoltan Borok-Nagy, Impala Public Jenkins,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/24063
to look at the new patch set (#8).
Change subject: IMPALA-14794: Implement small string optimization in Parquet
scanner
......................................................................
IMPALA-14794: Implement small string optimization in Parquet scanner
When decoding a string in a parquet file, it is now always smallified
if possible.
With plain encoding, if a column's data page only contains smallified
strings, the data page's memory is no longer attached to the tuple,
since no string points into it.
With dictionary encoding, string column readers always allocate a copy
of the dictionary page so strings can point into it. If all strings are
smallified, this copy is freed after decoding all data, because no
strings point into it.
Measurements:
Measured time of a join query where lots of small strings are sent from
the reader:
select l1.* from tpch_parquet.lineitem l1
join tpch_parquet.lineitem l2 on l1.l_shipmode = l2.l_shipmode
limit 1;
Before:
KrpcDataStreamSender: SerializeBatchTime: 84.385ms
HDFS_SCAN_NODE: MaterializeTupleTime: 8.183ms
After:
KrpcDataStreamSender: SerializeBatchTime: 67.598ms
HDFS_SCAN_NODE: MaterializeTupleTime: 8.632ms
Same measurement with a table with a higher scale factor:
select l1.* from tpch30_parquet_snap.lineitem l1
join tpch30_parquet_snap.lineitem l2 on l1.l_shipmode = l2.l_shipmode
limit 1;
Before:
KrpcDataStreamSender: SerializeBatchTime: 2s359ms
HDFS_SCAN_NODE: MaterializeTupleTime: 239.267ms
After:
KrpcDataStreamSender: SerializeBatchTime: 1s702ms
HDFS_SCAN_NODE: MaterializeTupleTime: 243.606ms
This is ~27% gain in SerializeBatchTime.
Testing:
-Added a test to parquet-plain-test.cc to test that small strings are
smallified upon decoding
Change-Id: I16c550d35cd6d3ec259b899b325611294137ccef
---
M be/src/exec/parquet/parquet-column-chunk-reader.cc
M be/src/exec/parquet/parquet-column-chunk-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-plain-test.cc
M be/src/runtime/smallable-string.h
M be/src/runtime/string-value.h
M be/src/testutil/random-vector-generators.h
8 files changed, 178 insertions(+), 6 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/24063/8
--
To view, visit http://gerrit.cloudera.org:8080/24063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I16c550d35cd6d3ec259b899b325611294137ccef
Gerrit-Change-Number: 24063
Gerrit-PatchSet: 8
Gerrit-Owner: Balazs Hevele <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>