MaxGekk commented on code in PR #56502:
URL: https://github.com/apache/spark/pull/56502#discussion_r3413956079
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala:
##########
@@ -436,6 +437,44 @@ object LogicalPlanIntegrity {
}
}
+ /**
+ * This method validates that no operator references one of its child's
output attributes with a
+ * narrower nullability than the child produces -- declaring the attribute
non-nullable while the
+ * child produces it as nullable. Such a mismatch (for example, an
outer-join-widened column that
+ * a synthesized operator references with its original non-nullable
metadata) is a latent
+ * plan-integrity defect: the rows are unchanged, so result-equivalence
reasoning and
+ * `validateSchemaOutput` (which compares types `asNullable`) both miss it,
yet a later
+ * optimization that trusts the non-nullable declaration may drop a null
check and produce wrong
+ * results. Declaring a reference *wider* (nullable over a non-nullable
child) is safe and is not
+ * flagged. References whose `ExprId` is not produced by any child (e.g.
outer references) are
+ * ignored.
+ *
+ * The check is disabled unless `spark.sql.planChangeValidation.nullability`
is set, because not
+ * all rules maintain precise nullability today; it is intended to be
enabled by targeted suites.
+ * Returns the error message if the check does not pass, or None otherwise.
+ */
+ def validateNullability(plan: LogicalPlan): Option[String] = {
+ if (!SQLConf.get.getConf(SQLConf.PLAN_CHANGE_VALIDATION_NULLABILITY)) {
+ return None
+ }
+ plan.collect {
+ case p if p.childrenResolved =>
+ val childNullable = p.children.flatMap(_.output).groupBy(_.exprId).map
{
Review Comment:
Minor: worth a one-line note that this is a strictly local
parent-vs-immediate-child check over `p.expressions`. A narrowing introduced at
a pure pass-through boundary (an operator that narrows an attribute it only
forwards via `output`, without re-referencing it in `expressions`) won't be
flagged at that node, and the parent then treats the already-narrowed value as
the child's truth. The targeted SPARK-56395 shape is caught because the
synthesized references appear in `expressions`, but making the scoping explicit
helps future readers.
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala:
##########
@@ -436,6 +437,44 @@ object LogicalPlanIntegrity {
}
}
+ /**
+ * This method validates that no operator references one of its child's
output attributes with a
Review Comment:
Could we make the scope limitations explicit in this scaladoc? The check
only compares **top-level** `AttributeReference.nullable` — nested nullability
(struct fields, array elements, map values) is not validated, so a child
producing `struct<a: long /* nullable */>` referenced as `struct<a: long /*
non-null */>` keeps the same top-level `nullable` and passes. This mirrors
`validateSchemaOutput` (which compares `asNullable`), so it's consistent, but
since that's exactly the "latent nullability defect" class this check targets,
it's worth calling out. Same for subquery plans, which aren't reached by
`plan.collect` (they live inside expressions, not as children).
##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanIntegritySuite.scala:
##########
@@ -50,4 +51,29 @@ class LogicalPlanIntegritySuite extends PlanTest {
assert(checkIfSameExprIdNotReused(t.select(Alias(a + b, "ab")(exprId =
a.exprId))).isDefined)
assert(checkIfSameExprIdNotReused(t.select(Alias(a + b, "ab")(exprId =
b.exprId))).isDefined)
}
+
+ test("Checks if an attribute is referenced as non-nullable over a nullable
child") {
Review Comment:
Good start, but this only exercises the simplest shape (one-level `Project`
over a `LocalRelation`, offending node at the root). Could we extend coverage
to the cases that exercise the actual logic and the motivating bug?
- **Multi-child join shape** — an `Aggregate`/operator over a (synthetic
`LEFT OUTER`) `Join` referencing the null-widened side as non-nullable. This is
the SPARK-56395 scenario and the only one that exercises `groupBy(_.exprId)`
across multiple children.
- **Offending reference nested in a compound expression** (e.g. `narrowedRef
+ 1L`, or wrapped in an `Alias`/function) to confirm the inner `_.collect`
finds it, not just bare top-level references.
- **Offending node below the root**, to confirm the `plan.collect` traversal
rather than root-only inspection.
- **`ExprId` not produced by any child, config ON**
(dangling/outer-reference exprId declared non-nullable) → asserts it's ignored.
Currently the "ignored" path is only covered implicitly via the config-disabled
assertion.
- **Duplicate `ExprId` across children, "any nullable wins"** — the behavior
the inline comment specifically claims, currently untested.
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala:
##########
@@ -474,6 +513,7 @@ object LogicalPlanIntegrity {
.orElse(LogicalPlanIntegrity.validateSchemaOutput(previousPlan,
currentPlan))
.orElse(LogicalPlanIntegrity.validateNoDanglingReferences(currentPlan))
.orElse(LogicalPlanIntegrity.validateAggregateExpressions(currentPlan))
+ .orElse(LogicalPlanIntegrity.validateNullability(currentPlan))
Review Comment:
Two things while we're touching this chain:
1. The `validateOptimizedPlan` scaladoc bullet list (just above this method)
is now out of date — it lists "is still resolved / special expressions / unique
IDs / same schema / no dangling references" but omits both
`validateAggregateExpressions` (pre-existing) and the new
`validateNullability`. Could we add a bullet, e.g. `- references each child
output attribute with consistent (or wider) nullability`, and one for the
aggregate check too?
2. Confirming intended scope: this runs only from `validateOptimizedPlan`,
so nullability defects introduced during analysis (not optimization) aren't
caught. Assuming that's deliberate — a one-liner in the config doc would make
it clear.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]