[
https://issues.apache.org/jira/browse/IMPALA-14794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18067185#comment-18067185
]
ASF subversion and git services commented on IMPALA-14794:
----------------------------------------------------------
Commit 49f0ab1d09541f960cee12e9a5e6aa38ec21565a in impala's branch
refs/heads/master from Balazs Hevele
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=49f0ab1d0 ]
IMPALA-14794: Implement small string optimization in Parquet scanner
When decoding a string in a parquet file, it is now always smallified
if possible.
With plain encoding, if a column's data page only contains smallified
strings, the data page's memory is no longer attached to the tuple,
since no string points into it.
With dictionary encoding, string column readers always allocate a copy
of the dictionary page so strings can point into it. If all strings are
smallified, this copy is freed after decoding all data, because no
strings point into it.
Measurements:
Measured time of a join query where lots of small strings are sent from
the reader:
select l1.* from tpch_parquet.lineitem l1
join tpch_parquet.lineitem l2 on l1.l_shipmode = l2.l_shipmode
limit 1;
Before:
KrpcDataStreamSender: SerializeBatchTime: 84.385ms
HDFS_SCAN_NODE: MaterializeTupleTime: 8.183ms
After:
KrpcDataStreamSender: SerializeBatchTime: 67.598ms
HDFS_SCAN_NODE: MaterializeTupleTime: 8.632ms
Same measurement with a table with a higher scale factor:
select l1.* from tpch30_parquet_snap.lineitem l1
join tpch30_parquet_snap.lineitem l2 on l1.l_shipmode = l2.l_shipmode
limit 1;
Before:
KrpcDataStreamSender: SerializeBatchTime: 2s359ms
HDFS_SCAN_NODE: MaterializeTupleTime: 239.267ms
After:
KrpcDataStreamSender: SerializeBatchTime: 1s702ms
HDFS_SCAN_NODE: MaterializeTupleTime: 243.606ms
This is ~27% gain in SerializeBatchTime.
Testing:
-Added a test to parquet-plain-test.cc to test that small strings are
smallified upon decoding
Change-Id: I16c550d35cd6d3ec259b899b325611294137ccef
Reviewed-on: http://gerrit.cloudera.org:8080/24063
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Implement small string optimization in Parquet scanner
> ------------------------------------------------------
>
> Key: IMPALA-14794
> URL: https://issues.apache.org/jira/browse/IMPALA-14794
> Project: IMPALA
> Issue Type: Sub-task
> Components: Backend
> Reporter: Csaba Ringhofer
> Assignee: Balazs Hevele
> Priority: Major
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]