[ 
https://issues.apache.org/jira/browse/SPARK-32859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32859:
------------------------------------

    Assignee:     (was: Apache Spark)

> Introduce SQL physical plan rule to decide enable/disable bucketing 
> --------------------------------------------------------------------
>
>                 Key: SPARK-32859
>                 URL: https://issues.apache.org/jira/browse/SPARK-32859
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Cheng Su
>            Priority: Minor
>
> Discussed with [~cloud_fan] offline, it would be better that we can decide 
> enable/disable SQL bucketing automatically according to query plan. Currently 
> bucketing is enabled by default (`spark.sql.sources.bucketing.enabled`=true), 
> so for all bucketed tables in the query plan, we will use bucket table scan 
> (all input files per the bucket will be read by same task). This has the 
> drawback that if the bucket table scan is not benefitting at all (no 
> join/groupby/etc in the query), we don't need to use bucket table scan as it 
> would restrict the # of tasks to be # of buckets and might hurt parallelism.
>  
> The proposed change is to introduce a physical plan rule (right before 
> `ensureRequirements`):
> (1).transformUp() physical plan, matching SparkPlan operator which is 
> FileSourceScanExec, if optionalBucketSet is set, enabling bucket scan (bucket 
> filter in this case).
> (2).transformUp() physical plan, matching SparkPlan operator which is 
> SparkPlanWithInterestingPartitioning.
> SparkPlanWithInterestingPartitioning: the plan is in \{SortMergeJoinExec, 
> ShuffledHashJoinExec, HashAggregateExec, ObjectHashAggregateExec, 
> SortAggregateExec, etc, which has 
> HashClusteredDistribution/ClusteredDistribution in 
> requiredChildDistribution}, and its requiredChildDistribution 
> HashClusteredDistribution/ClusteredDistribution on its underlying 
> FileSourceScanExec's bucketed columns.
> (3).for any child of SparkPlanWithInterestingPartitioning, which does not 
> satisfy the plan's requiredChildDistribution: go though the child's sub query 
> plan tree.
>  if (3.1).all node's outputPartitioning is same as child, and all node's 
> requiredChildDistribution is UnspecifiedDistribution.
>  and (3.2).the leaf node is FileSourceScanExec on bucketed table and
>  and (3.3).if enabling bucket scan for this FileSourceScanExec, the 
> outputPartitioning of FileSourceScanExec satisfies requiredChildDistribution 
> of SparkPlanWithInterestingPartitioning.
>  If (3.1),(3.2),(3.3) are all true, enabling bucket scan for this 
> FileSourceScanExec. And double check the new child of 
> SparkPlanWithInterestingPartitioning satisfies requiredChildDistribution.
>  
> The idea of SparkPlanWithInterestingPartitioning, is inspired from 
> "interesting order" in "Access Path Selection in a Relational Database 
> Management 
> System"([http://www.inf.ed.ac.uk/teaching/courses/adbs/AccessPath.pdf]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to