[
https://issues.apache.org/jira/browse/SPARK-21220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-21220:
---------------------------------
Labels: bulk-closed (was: )
> Use outputPartitioning's bucketing if possible on write
> -------------------------------------------------------
>
> Key: SPARK-21220
> URL: https://issues.apache.org/jira/browse/SPARK-21220
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 2.2.0
> Reporter: Andrew Ash
> Priority: Major
> Labels: bulk-closed
>
> When reading a bucketed dataset and writing it back with no transformations
> (a copy) the bucketing information is lost and the user is required to
> re-specify the bucketing information on write. This negatively affects read
> performance on the copied dataset since the bucketing information enables
> significant optimizations that aren't possible on the un-bucketed copied
> table.
> Spark should propagate this bucketing information for copied datasets, and
> more generally could support inferring bucket information based on the known
> partitioning of the final RDD at save time when that partitioning is a
> {{HashPartitioning}}.
> https://github.com/apache/spark/blob/v2.2.0-rc5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L118
> In the above linked {{bucketIdExpression}}, we could {{.orElse}} a bucket
> expression based on outputPartitionings that are HashPartitioning.
> This preserves bucket information for bucketed datasets, and also supports
> saving this metadata at write time for datasets with a known partitioning.
> Both of these cases should improve performance at read time of the
> newly-written dataset.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]