[
https://issues.apache.org/jira/browse/SPARK-5309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14282263#comment-14282263
]
MIchael Davies edited comment on SPARK-5309 at 1/19/15 8:35 AM:
----------------------------------------------------------------
Additionally noticed that predicates that are pushed down to Parquet are
evaluated something like:
{code}
getNextRow
while {
read entire row applying any Binary->String conversions (some predicate
calculations nested in this)
if predicate fails loop otherwise return row
}
{code}
For filters applied to column values that change slowly this is not very
efficient.
was (Author: michael davies):
Additionally noticed that predicates that are pushed down to Parquet are
evaluated something like:
getNextRow
while {
read entire row applying any Binary->String conversions (some predicate
calculations nested in this)
if predicate fails loop otherwise return row
}
For filters applied to column values that change slowly this is not very
efficient.
> Reduce Binary/String conversion overhead when reading/writing Parquet files
> ---------------------------------------------------------------------------
>
> Key: SPARK-5309
> URL: https://issues.apache.org/jira/browse/SPARK-5309
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 1.2.0
> Reporter: MIchael Davies
> Priority: Minor
>
> Converting between Parquet Binary and Java Strings can form a significant
> proportion of query times.
> For columns which have repeated String values (which is common) the same
> Binary will be repeatedly being converted.
> A simple change to cache the last converted String per column was shown to
> reduce query times by 25% when grouping on a data set of 66M rows on a column
> with many repeated Strings.
> A possible optimisation would be to hand responsibility for Binary
> encoding/decoding over to Parquet so that it could ensure that this was done
> only once per Binary value.
> Next step is to look at Parquet code and to discuss with that project, which
> I will do.
> More details are available on this discussion:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-td10141.html
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]