[jira] [Updated] (SPARK-21220) Use outputPartitioning's bucketing if possible on write

Hyukjin Kwon (JIRA) Mon, 20 May 2019 21:28:57 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-21220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon updated SPARK-21220:
---------------------------------
    Labels: bulk-closed  (was: )

> Use outputPartitioning's bucketing if possible on write
> -------------------------------------------------------
>
>                 Key: SPARK-21220
>                 URL: https://issues.apache.org/jira/browse/SPARK-21220
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.2.0
>            Reporter: Andrew Ash
>            Priority: Major
>              Labels: bulk-closed
>
> When reading a bucketed dataset and writing it back with no transformations 
> (a copy) the bucketing information is lost and the user is required to 
> re-specify the bucketing information on write.  This negatively affects read 
> performance on the copied dataset since the bucketing information enables 
> significant optimizations that aren't possible on the un-bucketed copied 
> table.
> Spark should propagate this bucketing information for copied datasets, and 
> more generally could support inferring bucket information based on the known 
> partitioning of the final RDD at save time when that partitioning is a 
> {{HashPartitioning}}.
> https://github.com/apache/spark/blob/v2.2.0-rc5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L118
> In the above linked {{bucketIdExpression}}, we could {{.orElse}} a bucket 
> expression based on outputPartitionings that are HashPartitioning.
> This preserves bucket information for bucketed datasets, and also supports 
> saving this metadata at write time for datasets with a known partitioning.  
> Both of these cases should improve performance at read time of the 
> newly-written dataset.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-21220) Use outputPartitioning's bucketing if possible on write

Reply via email to