Hello Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/24063

to look at the new patch set (#6).

Change subject: WIP IMPALA-14794: Implement small string optimization in 
Parquet scanner
......................................................................

WIP IMPALA-14794: Implement small string optimization in Parquet scanner

When decoding a string in a parquet files, it is now always smallified
if possible.
With plain encoding, if a column's data page only contains smallified
strings, the data page's memory is no longer attached to the tuple,
since no string points into it.
With dictionary encoding, string column readers always allocate a copy
of the dictionary page so strings can point into it. If all strings are
smallified, this copy is freed after decoding all data, because no
strings point into it.

Measurements:
Measured time of a join query where lots of small strings are sent from
the reader:
  select l1.* from tpch_parquet.lineitem l1
    join tpch_parquet.lineitem l2 on l1.l_shipmode = l2.l_shipmode
    limit 1;

Before:
  KrpcDataStreamSender: SerializeBatchTime: ~90.0ms
After:
  KrpcDataStreamSender: SerializeBatchTime: ~65.0ms

This is ~28% gain in speed. Most of the gain comes from deepcopy.

TODO:
-test?
  -test that all cases work properly (plain/dict encoding,
   small/large strings)
  -test that page data is not actually attached to the tuple when
   there are only small strings?

Change-Id: I16c550d35cd6d3ec259b899b325611294137ccef
---
M be/src/exec/parquet/parquet-column-chunk-reader.cc
M be/src/exec/parquet/parquet-column-chunk-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-common.h
M be/src/runtime/string-value.h
5 files changed, 127 insertions(+), 6 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/24063/6
--
To view, visit http://gerrit.cloudera.org:8080/24063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I16c550d35cd6d3ec259b899b325611294137ccef
Gerrit-Change-Number: 24063
Gerrit-PatchSet: 6
Gerrit-Owner: Balazs Hevele <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>

Reply via email to