sunchao commented on code in PR #56603:
URL: https://github.com/apache/spark/pull/56603#discussion_r3447588974
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala:
##########
@@ -134,45 +134,63 @@ object PartitionPruning extends Rule[LogicalPlan] with
PredicateHelper with Join
* in bytes of the plan on the other side of the join. We estimate the
filtering ratio
* using column statistics if they are available, otherwise we use the
config value of
* `spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio`.
+ *
+ * The fallback ratio is only meaningful "when CBO stats are missing, but
there is a predicate
+ * that is likely to be selective" -- so it is used only when
`hasSelectivePredicate` is true. A
+ * filtering side that is eligible only because it is already materialized
(a LocalRelation or a
+ * checkpoint-derived LogicalRDD, SPARK-54593) carries no such predicate;
for it we rely solely on
+ * the statistics-based ratio and report no benefit when statistics are
unavailable, so it is not
+ * injected as a standalone always-applied subquery on a guessed ratio. A
statistics-based ratio,
+ * when available, is always honored regardless of `hasSelectivePredicate`.
*/
private def pruningHasBenefit(
partExpr: Expression,
partPlan: LogicalPlan,
otherExpr: Expression,
- otherPlan: LogicalPlan): Boolean = {
+ otherPlan: LogicalPlan,
+ hasSelectivePredicate: Boolean): Boolean = {
// get the distinct counts of an attribute for a given table
def distinctCounts(attr: Attribute, plan: LogicalPlan): Option[BigInt] = {
plan.stats.attributeStats.get(attr).flatMap(_.distinctCount)
}
- // the default filtering ratio when CBO stats are missing, but there is a
- // predicate that is likely to be selective
- val fallbackRatio = conf.dynamicPartitionPruningFallbackFilterRatio
- // the filtering ratio based on the type of the join condition and on the
column statistics
- val filterRatio = (partExpr.references.toList,
otherExpr.references.toList) match {
- // filter out expressions with more than one attribute on any side of
the operator
- case (leftAttr :: Nil, rightAttr :: Nil)
- if conf.dynamicPartitionPruningUseStats =>
- // get the CBO stats for each attribute in the join condition
- val partDistinctCount = distinctCounts(leftAttr, partPlan)
- val otherDistinctCount = distinctCounts(rightAttr, otherPlan)
- val availableStats = partDistinctCount.isDefined &&
partDistinctCount.get > 0 &&
- otherDistinctCount.isDefined
- if (!availableStats) {
- fallbackRatio
- } else if (partDistinctCount.get.toDouble <=
otherDistinctCount.get.toDouble) {
- // there is likely an estimation error, so we fallback
- fallbackRatio
- } else {
- 1 - otherDistinctCount.get.toDouble /
partDistinctCount.get.toDouble
- }
- case _ => fallbackRatio
+ // the filtering ratio derived from column statistics, when reliable stats
are available
+ val statsBasedRatio: Option[Double] =
+ (partExpr.references.toList, otherExpr.references.toList) match {
Review Comment:
[P2] Apply `maxRows` independently of the filtering key's reference count
The single-reference requirement on `otherExpr` prevents the new row-bound
path from handling expressions even though `maxRows` bounds the NDV of any
expression evaluated over those rows. For example, let the fact partition key
have NDV 20 and join it to a one-row `LocalRelation(x = 0, y = 0)` using
`fact.p = x + y`. The filtering-key NDV is provably at most 1, so the row bound
establishes at least 95% pruning. In a focused base-versus-head run with
`reuseBroadcastOnly=false` and no reusable broadcast, the base inserted
standalone DPP and scanned 1/20 partitions; this head inserted no DPP and
scanned all 20.
Could we require a single partition-side attribute for its NDV, but consult
a filtering-side column statistic only when available and otherwise use
`maxRows` regardless of `otherExpr.references.size`? The `maxRows` branch
should also be limited to a repeatable materialized plan, per the separate
correctness concern below.
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala:
##########
@@ -134,45 +134,63 @@ object PartitionPruning extends Rule[LogicalPlan] with
PredicateHelper with Join
* in bytes of the plan on the other side of the join. We estimate the
filtering ratio
* using column statistics if they are available, otherwise we use the
config value of
* `spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio`.
+ *
+ * The fallback ratio is only meaningful "when CBO stats are missing, but
there is a predicate
+ * that is likely to be selective" -- so it is used only when
`hasSelectivePredicate` is true. A
+ * filtering side that is eligible only because it is already materialized
(a LocalRelation or a
+ * checkpoint-derived LogicalRDD, SPARK-54593) carries no such predicate;
for it we rely solely on
+ * the statistics-based ratio and report no benefit when statistics are
unavailable, so it is not
+ * injected as a standalone always-applied subquery on a guessed ratio. A
statistics-based ratio,
+ * when available, is always honored regardless of `hasSelectivePredicate`.
*/
private def pruningHasBenefit(
partExpr: Expression,
partPlan: LogicalPlan,
otherExpr: Expression,
- otherPlan: LogicalPlan): Boolean = {
+ otherPlan: LogicalPlan,
+ hasSelectivePredicate: Boolean): Boolean = {
// get the distinct counts of an attribute for a given table
def distinctCounts(attr: Attribute, plan: LogicalPlan): Option[BigInt] = {
plan.stats.attributeStats.get(attr).flatMap(_.distinctCount)
}
- // the default filtering ratio when CBO stats are missing, but there is a
- // predicate that is likely to be selective
- val fallbackRatio = conf.dynamicPartitionPruningFallbackFilterRatio
- // the filtering ratio based on the type of the join condition and on the
column statistics
- val filterRatio = (partExpr.references.toList,
otherExpr.references.toList) match {
- // filter out expressions with more than one attribute on any side of
the operator
- case (leftAttr :: Nil, rightAttr :: Nil)
- if conf.dynamicPartitionPruningUseStats =>
- // get the CBO stats for each attribute in the join condition
- val partDistinctCount = distinctCounts(leftAttr, partPlan)
- val otherDistinctCount = distinctCounts(rightAttr, otherPlan)
- val availableStats = partDistinctCount.isDefined &&
partDistinctCount.get > 0 &&
- otherDistinctCount.isDefined
- if (!availableStats) {
- fallbackRatio
- } else if (partDistinctCount.get.toDouble <=
otherDistinctCount.get.toDouble) {
- // there is likely an estimation error, so we fallback
- fallbackRatio
- } else {
- 1 - otherDistinctCount.get.toDouble /
partDistinctCount.get.toDouble
- }
- case _ => fallbackRatio
+ // the filtering ratio derived from column statistics, when reliable stats
are available
+ val statsBasedRatio: Option[Double] =
+ (partExpr.references.toList, otherExpr.references.toList) match {
+ // filter out expressions with more than one attribute on any side of
the operator
+ case (leftAttr :: Nil, rightAttr :: Nil)
+ if conf.dynamicPartitionPruningUseStats =>
+ // get the CBO stats for each attribute in the join condition
+ val partDistinctCount = distinctCounts(leftAttr, partPlan)
Review Comment:
[P2] Use the lineage-resolved partition attribute for the NDV lookup
`getFilterableTableScan` traces the pruning expression through projections
and aliases, but returns only the leaf and discards the resolved expression.
This method then pairs the original `partExpr` with that leaf. For
`fact.select($"p".as("alias_p"))`, `partExpr` references `alias_p#new`, while
the leaf's `AttributeMap` is keyed by `p#old`, so the valid fact-side NDV is
missed.
With fact NDV 20, filtering-side NDV 1, `reuseBroadcastOnly=false`, and no
reusable broadcast, the exact base scans 1/20 partitions through standalone
DPP, while this head scans 20/20 with no DPP. The lineage mismatch predates
this PR, but gating the fallback here turns that previously masked miss into a
regression. Could we carry the resolved expression from
`findExpressionAndTrackLineageDown` into this lookup and use its leaf attribute
when it is an `Attribute`?
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala:
##########
@@ -134,45 +134,63 @@ object PartitionPruning extends Rule[LogicalPlan] with
PredicateHelper with Join
* in bytes of the plan on the other side of the join. We estimate the
filtering ratio
* using column statistics if they are available, otherwise we use the
config value of
* `spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio`.
+ *
+ * The fallback ratio is only meaningful "when CBO stats are missing, but
there is a predicate
+ * that is likely to be selective" -- so it is used only when
`hasSelectivePredicate` is true. A
+ * filtering side that is eligible only because it is already materialized
(a LocalRelation or a
+ * checkpoint-derived LogicalRDD, SPARK-54593) carries no such predicate;
for it we rely solely on
+ * the statistics-based ratio and report no benefit when statistics are
unavailable, so it is not
+ * injected as a standalone always-applied subquery on a guessed ratio. A
statistics-based ratio,
+ * when available, is always honored regardless of `hasSelectivePredicate`.
*/
private def pruningHasBenefit(
partExpr: Expression,
partPlan: LogicalPlan,
otherExpr: Expression,
- otherPlan: LogicalPlan): Boolean = {
+ otherPlan: LogicalPlan,
+ hasSelectivePredicate: Boolean): Boolean = {
// get the distinct counts of an attribute for a given table
def distinctCounts(attr: Attribute, plan: LogicalPlan): Option[BigInt] = {
plan.stats.attributeStats.get(attr).flatMap(_.distinctCount)
}
- // the default filtering ratio when CBO stats are missing, but there is a
- // predicate that is likely to be selective
- val fallbackRatio = conf.dynamicPartitionPruningFallbackFilterRatio
- // the filtering ratio based on the type of the join condition and on the
column statistics
- val filterRatio = (partExpr.references.toList,
otherExpr.references.toList) match {
- // filter out expressions with more than one attribute on any side of
the operator
- case (leftAttr :: Nil, rightAttr :: Nil)
- if conf.dynamicPartitionPruningUseStats =>
- // get the CBO stats for each attribute in the join condition
- val partDistinctCount = distinctCounts(leftAttr, partPlan)
- val otherDistinctCount = distinctCounts(rightAttr, otherPlan)
- val availableStats = partDistinctCount.isDefined &&
partDistinctCount.get > 0 &&
- otherDistinctCount.isDefined
- if (!availableStats) {
- fallbackRatio
- } else if (partDistinctCount.get.toDouble <=
otherDistinctCount.get.toDouble) {
- // there is likely an estimation error, so we fallback
- fallbackRatio
- } else {
- 1 - otherDistinctCount.get.toDouble /
partDistinctCount.get.toDouble
- }
- case _ => fallbackRatio
+ // the filtering ratio derived from column statistics, when reliable stats
are available
+ val statsBasedRatio: Option[Double] =
+ (partExpr.references.toList, otherExpr.references.toList) match {
+ // filter out expressions with more than one attribute on any side of
the operator
+ case (leftAttr :: Nil, rightAttr :: Nil)
+ if conf.dynamicPartitionPruningUseStats =>
+ // get the CBO stats for each attribute in the join condition
+ val partDistinctCount = distinctCounts(leftAttr, partPlan)
+ // A materialized filtering side (e.g. a LocalRelation) may carry
no column statistics
+ // but an exact `maxRows`, which is a conservative upper bound on
its join-key NDV. Use
+ // it when the column statistic is missing so a small, selective
materialized side still
+ // yields a statistics-based ratio rather than falling through to
the gated fallback.
+ val otherDistinctCount =
+ distinctCounts(rightAttr,
otherPlan).orElse(otherPlan.maxRows.map(BigInt(_)))
+ val availableStats = partDistinctCount.isDefined &&
partDistinctCount.get > 0 &&
+ otherDistinctCount.isDefined
+ if (!availableStats) {
+ None
+ } else if (partDistinctCount.get.toDouble <=
otherDistinctCount.get.toDouble) {
+ // there is likely an estimation error, so there is no reliable
stats-based ratio
+ None
+ } else {
+ Some(1 - otherDistinctCount.get.toDouble /
partDistinctCount.get.toDouble)
Review Comment:
[P2] Do not treat a referenced attribute's NDV as the transformed partition
key's NDV
A single referenced attribute does not mean `partExpr` preserves that
attribute's NDV. For example, create 20 fact partitions with distinct even
values of `p` and join `(fact.p % 2) = keys.k` to a one-row `LocalRelation(k =
0)`. This code uses `NDV(p) = 20` and `maxRows = 1`, estimates `1 - 1/20 = 95%`
pruning, and injects standalone DPP. In reality, `p % 2` is 0 for every
partition, so the head still scans all 20 partitions; the exact base with
`fallbackFilterRatio=0` injects no DPP.
The raw-attribute-versus-expression limitation predates this PR, but the new
`maxRows` path in `6d2c13c` activates it in a configuration where the base does
not. Could we use the partition-side column NDV only when the lineage-resolved
`partExpr` is itself an `Attribute`, unless expression-level NDV is available?
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala:
##########
@@ -134,45 +134,63 @@ object PartitionPruning extends Rule[LogicalPlan] with
PredicateHelper with Join
* in bytes of the plan on the other side of the join. We estimate the
filtering ratio
* using column statistics if they are available, otherwise we use the
config value of
* `spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio`.
+ *
+ * The fallback ratio is only meaningful "when CBO stats are missing, but
there is a predicate
+ * that is likely to be selective" -- so it is used only when
`hasSelectivePredicate` is true. A
+ * filtering side that is eligible only because it is already materialized
(a LocalRelation or a
+ * checkpoint-derived LogicalRDD, SPARK-54593) carries no such predicate;
for it we rely solely on
+ * the statistics-based ratio and report no benefit when statistics are
unavailable, so it is not
+ * injected as a standalone always-applied subquery on a guessed ratio. A
statistics-based ratio,
+ * when available, is always honored regardless of `hasSelectivePredicate`.
*/
private def pruningHasBenefit(
partExpr: Expression,
partPlan: LogicalPlan,
otherExpr: Expression,
- otherPlan: LogicalPlan): Boolean = {
+ otherPlan: LogicalPlan,
+ hasSelectivePredicate: Boolean): Boolean = {
// get the distinct counts of an attribute for a given table
def distinctCounts(attr: Attribute, plan: LogicalPlan): Option[BigInt] = {
plan.stats.attributeStats.get(attr).flatMap(_.distinctCount)
}
- // the default filtering ratio when CBO stats are missing, but there is a
- // predicate that is likely to be selective
- val fallbackRatio = conf.dynamicPartitionPruningFallbackFilterRatio
- // the filtering ratio based on the type of the join condition and on the
column statistics
- val filterRatio = (partExpr.references.toList,
otherExpr.references.toList) match {
- // filter out expressions with more than one attribute on any side of
the operator
- case (leftAttr :: Nil, rightAttr :: Nil)
- if conf.dynamicPartitionPruningUseStats =>
- // get the CBO stats for each attribute in the join condition
- val partDistinctCount = distinctCounts(leftAttr, partPlan)
- val otherDistinctCount = distinctCounts(rightAttr, otherPlan)
- val availableStats = partDistinctCount.isDefined &&
partDistinctCount.get > 0 &&
- otherDistinctCount.isDefined
- if (!availableStats) {
- fallbackRatio
- } else if (partDistinctCount.get.toDouble <=
otherDistinctCount.get.toDouble) {
- // there is likely an estimation error, so we fallback
- fallbackRatio
- } else {
- 1 - otherDistinctCount.get.toDouble /
partDistinctCount.get.toDouble
- }
- case _ => fallbackRatio
+ // the filtering ratio derived from column statistics, when reliable stats
are available
+ val statsBasedRatio: Option[Double] =
+ (partExpr.references.toList, otherExpr.references.toList) match {
+ // filter out expressions with more than one attribute on any side of
the operator
+ case (leftAttr :: Nil, rightAttr :: Nil)
+ if conf.dynamicPartitionPruningUseStats =>
+ // get the CBO stats for each attribute in the join condition
+ val partDistinctCount = distinctCounts(leftAttr, partPlan)
+ // A materialized filtering side (e.g. a LocalRelation) may carry
no column statistics
+ // but an exact `maxRows`, which is a conservative upper bound on
its join-key NDV. Use
+ // it when the column statistic is missing so a small, selective
materialized side still
+ // yields a statistics-based ratio rather than falling through to
the gated fallback.
+ val otherDistinctCount =
+ distinctCounts(rightAttr,
otherPlan).orElse(otherPlan.maxRows.map(BigInt(_)))
Review Comment:
[P1] Restrict the `maxRows` fallback to repeatable filtering plans
This fallback is applied to every filtering plan, not only plans accepted by
`isRepeatableMaterializedPlan`. A selective `Filter` makes a side DPP-eligible
even when it contains stateful user code. For example: checkpointed `Seq(1)` ->
stateful `mapPartitions(counter.incrementAndGet)` -> `filter(p > 0)` ->
`limit(1)`. The limit supplies `maxRows = 1`. With stats enabled,
`fallbackFilterRatio=0`, `reuseBroadcastOnly=false`, and
broadcast/exchange/subquery reuse disabled, I get:
- base `90d1fbec`: no DPP, one evaluation, result `[1]`;
- head `8dba5590`: standalone DPP, two evaluations, result `[]`.
The DPP copy sees key 1 and prunes the fact table to partition 1; the join
copy then sees key 2 and needs the partition that was already removed. The
broader non-repeatable selective-plan safety gap exists on the base with a
nonzero fallback, but commit `6d2c13c` newly activates it for the previously
correct zero-fallback configuration through this unrestricted row bound. Could
we use `otherPlan.maxRows` only when `isRepeatableMaterializedPlan(otherPlan)`
(or another explicit repeatability proof) is true?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]