[ 
https://issues.apache.org/jira/browse/IMPALA-14794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18067185#comment-18067185
 ] 

ASF subversion and git services commented on IMPALA-14794:
----------------------------------------------------------

Commit 49f0ab1d09541f960cee12e9a5e6aa38ec21565a in impala's branch 
refs/heads/master from Balazs Hevele
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=49f0ab1d0 ]

IMPALA-14794: Implement small string optimization in Parquet scanner

When decoding a string in a parquet file, it is now always smallified
if possible.
With plain encoding, if a column's data page only contains smallified
strings, the data page's memory is no longer attached to the tuple,
since no string points into it.
With dictionary encoding, string column readers always allocate a copy
of the dictionary page so strings can point into it. If all strings are
smallified, this copy is freed after decoding all data, because no
strings point into it.

Measurements:
Measured time of a join query where lots of small strings are sent from
the reader:
  select l1.* from tpch_parquet.lineitem l1
    join tpch_parquet.lineitem l2 on l1.l_shipmode = l2.l_shipmode
    limit 1;

Before:
  KrpcDataStreamSender: SerializeBatchTime: 84.385ms
  HDFS_SCAN_NODE: MaterializeTupleTime: 8.183ms
After:
  KrpcDataStreamSender: SerializeBatchTime: 67.598ms
  HDFS_SCAN_NODE: MaterializeTupleTime: 8.632ms

Same measurement with a table with a higher scale factor:
  select l1.* from tpch30_parquet_snap.lineitem l1
    join tpch30_parquet_snap.lineitem l2 on l1.l_shipmode = l2.l_shipmode
    limit 1;

Before:
  KrpcDataStreamSender: SerializeBatchTime: 2s359ms
  HDFS_SCAN_NODE: MaterializeTupleTime: 239.267ms
After:
  KrpcDataStreamSender: SerializeBatchTime: 1s702ms
  HDFS_SCAN_NODE: MaterializeTupleTime: 243.606ms

This is ~27% gain in SerializeBatchTime.

Testing:
-Added a test to parquet-plain-test.cc to test that small strings are
smallified upon decoding

Change-Id: I16c550d35cd6d3ec259b899b325611294137ccef
Reviewed-on: http://gerrit.cloudera.org:8080/24063
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Implement small string optimization in Parquet scanner
> ------------------------------------------------------
>
>                 Key: IMPALA-14794
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14794
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Backend
>            Reporter: Csaba Ringhofer
>            Assignee: Balazs Hevele
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to