Andrew Ash created SPARK-21220:
----------------------------------
Summary: Use outputPartitioning's bucketing if possible on write
Key: SPARK-21220
URL: https://issues.apache.org/jira/browse/SPARK-21220
Project: Spark
Issue Type: Improvement
Components: Spark Core
Affects Versions: 2.2.0
Reporter: Andrew Ash
When reading a bucketed dataset and writing it back with no transformations (a
copy) the bucketing information is lost and the user is required to re-specify
the bucketing information on write. This negatively affects read performance
on the copied dataset since the bucketing information enables significant
optimizations that aren't possible on the un-bucketed copied table.
Spark should propagate this bucketing information for copied datasets, and more
generally could support inferring bucket information based on the known
partitioning of the final RDD at save time when that partitioning is a
{{HashPartitioning}}.
https://github.com/apache/spark/blob/v2.2.0-rc5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L118
In the above linked {{bucketIdExpression}}, we could {{.orElse}} a bucket
expression based on outputPartitionings that are HashPartitioning.
This preserves bucket information for bucketed datasets, and also supports
saving this metadata at write time for datasets with a known partitioning.
Both of these cases should improve performance at read time of the
newly-written dataset.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]