GitHub user nongli opened a pull request:
https://github.com/apache/spark/pull/11434
[SPARK-13574][SQL] Improve parquet decoding of dictionary encoded strings.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
Before this patch, the decoding of dictionary encoded strings would explode
the
dictionary. Every value we decode will create a new copy of the data in the
columnar
batch. With this patch, we decode the dictionary values once into the
columnar batch
and then for each data value, just populate the length and offset.
## How was this patch tested?
Results:
```
Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
String Dictionary: Best/Avg Time(ms) Rate(M/s) Per
Row(ns)
--------------------------------------------------------------------------------
SQL Parquet Vectorized 481 / 503 21.8
45.9
SQL Parquet Vectorized (Before) 692 / 746 15.2
66.0
SQL Parquet MR 1097 / 1273 9.6
104.6
```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/nongli/spark spark-13574
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11434.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11434
----
commit e6394139ec695dfd0a3467f977a220848fce8588
Author: Nong Li <[email protected]>
Date: 2016-02-28T01:32:41Z
[SPARK-13574][SQL] Improve parquet decoding of dictionary encoded strings.
Before this patch, the decoding of dictionary encoded strings would explode
the
dictionary. Every value we decode will create a new copy of the data in the
columnar
batch. With this patch, we decode the dictionary values once into the
columnar batch
and then for each data value, just populate the length and offset.
Results:
Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
String Dictionary: Best/Avg Time(ms) Rate(M/s) Per
Row(ns)
--------------------------------------------------------------------------------
SQL Parquet Vectorized 481 / 503 21.8
45.9
SQL Parquet Vectorized (Before) 692 / 746 15.2
66.0
SQL Parquet MR 1097 / 1273 9.6
104.6
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]