[ https://issues.apache.org/jira/browse/SPARK-32859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-32859: ------------------------------------ Assignee: (was: Apache Spark) > Introduce SQL physical plan rule to decide enable/disable bucketing > -------------------------------------------------------------------- > > Key: SPARK-32859 > URL: https://issues.apache.org/jira/browse/SPARK-32859 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.1.0 > Reporter: Cheng Su > Priority: Minor > > Discussed with [~cloud_fan] offline, it would be better that we can decide > enable/disable SQL bucketing automatically according to query plan. Currently > bucketing is enabled by default (`spark.sql.sources.bucketing.enabled`=true), > so for all bucketed tables in the query plan, we will use bucket table scan > (all input files per the bucket will be read by same task). This has the > drawback that if the bucket table scan is not benefitting at all (no > join/groupby/etc in the query), we don't need to use bucket table scan as it > would restrict the # of tasks to be # of buckets and might hurt parallelism. > > The proposed change is to introduce a physical plan rule (right before > `ensureRequirements`): > (1).transformUp() physical plan, matching SparkPlan operator which is > FileSourceScanExec, if optionalBucketSet is set, enabling bucket scan (bucket > filter in this case). > (2).transformUp() physical plan, matching SparkPlan operator which is > SparkPlanWithInterestingPartitioning. > SparkPlanWithInterestingPartitioning: the plan is in \{SortMergeJoinExec, > ShuffledHashJoinExec, HashAggregateExec, ObjectHashAggregateExec, > SortAggregateExec, etc, which has > HashClusteredDistribution/ClusteredDistribution in > requiredChildDistribution}, and its requiredChildDistribution > HashClusteredDistribution/ClusteredDistribution on its underlying > FileSourceScanExec's bucketed columns. > (3).for any child of SparkPlanWithInterestingPartitioning, which does not > satisfy the plan's requiredChildDistribution: go though the child's sub query > plan tree. > if (3.1).all node's outputPartitioning is same as child, and all node's > requiredChildDistribution is UnspecifiedDistribution. > and (3.2).the leaf node is FileSourceScanExec on bucketed table and > and (3.3).if enabling bucket scan for this FileSourceScanExec, the > outputPartitioning of FileSourceScanExec satisfies requiredChildDistribution > of SparkPlanWithInterestingPartitioning. > If (3.1),(3.2),(3.3) are all true, enabling bucket scan for this > FileSourceScanExec. And double check the new child of > SparkPlanWithInterestingPartitioning satisfies requiredChildDistribution. > > The idea of SparkPlanWithInterestingPartitioning, is inspired from > "interesting order" in "Access Path Selection in a Relational Database > Management > System"([http://www.inf.ed.ac.uk/teaching/courses/adbs/AccessPath.pdf]). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org