MaxGekk commented on code in PR #56502:
URL: https://github.com/apache/spark/pull/56502#discussion_r3413956079


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala:
##########
@@ -436,6 +437,44 @@ object LogicalPlanIntegrity {
     }
   }
 
+  /**
+   * This method validates that no operator references one of its child's 
output attributes with a
+   * narrower nullability than the child produces -- declaring the attribute 
non-nullable while the
+   * child produces it as nullable. Such a mismatch (for example, an 
outer-join-widened column that
+   * a synthesized operator references with its original non-nullable 
metadata) is a latent
+   * plan-integrity defect: the rows are unchanged, so result-equivalence 
reasoning and
+   * `validateSchemaOutput` (which compares types `asNullable`) both miss it, 
yet a later
+   * optimization that trusts the non-nullable declaration may drop a null 
check and produce wrong
+   * results. Declaring a reference *wider* (nullable over a non-nullable 
child) is safe and is not
+   * flagged. References whose `ExprId` is not produced by any child (e.g. 
outer references) are
+   * ignored.
+   *
+   * The check is disabled unless `spark.sql.planChangeValidation.nullability` 
is set, because not
+   * all rules maintain precise nullability today; it is intended to be 
enabled by targeted suites.
+   * Returns the error message if the check does not pass, or None otherwise.
+   */
+  def validateNullability(plan: LogicalPlan): Option[String] = {
+    if (!SQLConf.get.getConf(SQLConf.PLAN_CHANGE_VALIDATION_NULLABILITY)) {
+      return None
+    }
+    plan.collect {
+      case p if p.childrenResolved =>
+        val childNullable = p.children.flatMap(_.output).groupBy(_.exprId).map 
{

Review Comment:
   Minor: worth a one-line note that this is a strictly local 
parent-vs-immediate-child check over `p.expressions`. A narrowing introduced at 
a pure pass-through boundary (an operator that narrows an attribute it only 
forwards via `output`, without re-referencing it in `expressions`) won't be 
flagged at that node, and the parent then treats the already-narrowed value as 
the child's truth. The targeted SPARK-56395 shape is caught because the 
synthesized references appear in `expressions`, but making the scoping explicit 
helps future readers.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala:
##########
@@ -436,6 +437,44 @@ object LogicalPlanIntegrity {
     }
   }
 
+  /**
+   * This method validates that no operator references one of its child's 
output attributes with a

Review Comment:
   Could we make the scope limitations explicit in this scaladoc? The check 
only compares **top-level** `AttributeReference.nullable` — nested nullability 
(struct fields, array elements, map values) is not validated, so a child 
producing `struct<a: long /* nullable */>` referenced as `struct<a: long /* 
non-null */>` keeps the same top-level `nullable` and passes. This mirrors 
`validateSchemaOutput` (which compares `asNullable`), so it's consistent, but 
since that's exactly the "latent nullability defect" class this check targets, 
it's worth calling out. Same for subquery plans, which aren't reached by 
`plan.collect` (they live inside expressions, not as children).



##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanIntegritySuite.scala:
##########
@@ -50,4 +51,29 @@ class LogicalPlanIntegritySuite extends PlanTest {
     assert(checkIfSameExprIdNotReused(t.select(Alias(a + b, "ab")(exprId = 
a.exprId))).isDefined)
     assert(checkIfSameExprIdNotReused(t.select(Alias(a + b, "ab")(exprId = 
b.exprId))).isDefined)
   }
+
+  test("Checks if an attribute is referenced as non-nullable over a nullable 
child") {

Review Comment:
   Good start, but this only exercises the simplest shape (one-level `Project` 
over a `LocalRelation`, offending node at the root). Could we extend coverage 
to the cases that exercise the actual logic and the motivating bug?
   
   - **Multi-child join shape** — an `Aggregate`/operator over a (synthetic 
`LEFT OUTER`) `Join` referencing the null-widened side as non-nullable. This is 
the SPARK-56395 scenario and the only one that exercises `groupBy(_.exprId)` 
across multiple children.
   - **Offending reference nested in a compound expression** (e.g. `narrowedRef 
+ 1L`, or wrapped in an `Alias`/function) to confirm the inner `_.collect` 
finds it, not just bare top-level references.
   - **Offending node below the root**, to confirm the `plan.collect` traversal 
rather than root-only inspection.
   - **`ExprId` not produced by any child, config ON** 
(dangling/outer-reference exprId declared non-nullable) → asserts it's ignored. 
Currently the "ignored" path is only covered implicitly via the config-disabled 
assertion.
   - **Duplicate `ExprId` across children, "any nullable wins"** — the behavior 
the inline comment specifically claims, currently untested.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala:
##########
@@ -474,6 +513,7 @@ object LogicalPlanIntegrity {
       .orElse(LogicalPlanIntegrity.validateSchemaOutput(previousPlan, 
currentPlan))
       .orElse(LogicalPlanIntegrity.validateNoDanglingReferences(currentPlan))
       .orElse(LogicalPlanIntegrity.validateAggregateExpressions(currentPlan))
+      .orElse(LogicalPlanIntegrity.validateNullability(currentPlan))

Review Comment:
   Two things while we're touching this chain:
   
   1. The `validateOptimizedPlan` scaladoc bullet list (just above this method) 
is now out of date — it lists "is still resolved / special expressions / unique 
IDs / same schema / no dangling references" but omits both 
`validateAggregateExpressions` (pre-existing) and the new 
`validateNullability`. Could we add a bullet, e.g. `- references each child 
output attribute with consistent (or wider) nullability`, and one for the 
aggregate check too?
   2. Confirming intended scope: this runs only from `validateOptimizedPlan`, 
so nullability defects introduced during analysis (not optimization) aren't 
caught. Assuming that's deliberate — a one-liner in the config doc would make 
it clear.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to