[
https://issues.apache.org/jira/browse/SPARK-5309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14282790#comment-14282790
]
MIchael Davies edited comment on SPARK-5309 at 1/19/15 6:15 PM:
----------------------------------------------------------------
Looking at Parquet code - it looks like hooks are already in place to support
reducing conversion overhead by taking advantage of dictionaries.
In particular PrimitiveConverter has methods hasDictionarySupport and
addValueFromDictionary for this purpose. These are not used by
CatalystPrimitiveConverter.
I will get a PR together covering this as query performance savings can be
substantial
was (Author: michael davies):
Looking at Parquet code - it looks like hooks are already in place to support
this.
In particular PrimitiveConverter has methods hasDictionarySupport and
addValueFromDictionary for this purpose.
These are not used by CatalystPrimitiveConverter.
> Reduce Binary/String conversion overhead when reading/writing Parquet files
> ---------------------------------------------------------------------------
>
> Key: SPARK-5309
> URL: https://issues.apache.org/jira/browse/SPARK-5309
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 1.2.0
> Reporter: MIchael Davies
> Priority: Minor
>
> Converting between Parquet Binary and Java Strings can form a significant
> proportion of query times.
> For columns which have repeated String values (which is common) the same
> Binary will be repeatedly being converted.
> A simple change to cache the last converted String per column was shown to
> reduce query times by 25% when grouping on a data set of 66M rows on a column
> with many repeated Strings.
> A possible optimisation would be to hand responsibility for Binary
> encoding/decoding over to Parquet so that it could ensure that this was done
> only once per Binary value.
> Next step is to look at Parquet code and to discuss with that project, which
> I will do.
> More details are available on this discussion:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-td10141.html
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]