Andrew Ash created SPARK-21220:
----------------------------------

             Summary: Use outputPartitioning's bucketing if possible on write
                 Key: SPARK-21220
                 URL: https://issues.apache.org/jira/browse/SPARK-21220
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 2.2.0
            Reporter: Andrew Ash


When reading a bucketed dataset and writing it back with no transformations (a 
copy) the bucketing information is lost and the user is required to re-specify 
the bucketing information on write.  This negatively affects read performance 
on the copied dataset since the bucketing information enables significant 
optimizations that aren't possible on the un-bucketed copied table.

Spark should propagate this bucketing information for copied datasets, and more 
generally could support inferring bucket information based on the known 
partitioning of the final RDD at save time when that partitioning is a 
{{HashPartitioning}}.

https://github.com/apache/spark/blob/v2.2.0-rc5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L118

In the above linked {{bucketIdExpression}}, we could {{.orElse}} a bucket 
expression based on outputPartitionings that are HashPartitioning.

This preserves bucket information for bucketed datasets, and also supports 
saving this metadata at write time for datasets with a known partitioning.  
Both of these cases should improve performance at read time of the 
newly-written dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to