cloud-fan commented on code in PR #56237:
URL: https://github.com/apache/spark/pull/56237#discussion_r3338533796


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala:
##########
@@ -396,7 +396,13 @@ package object expressions  {
           // Then this will add ExtractValue("c", ExtractValue("b", a)), and 
alias the final
           // expression as "c".
           val fieldExprs = nestedFields.foldLeft(a: Expression) { (e, name) =>
-            ExtractValue(e, Literal(name), resolver)
+            // Extracting a field from a NULL (NullType) base yields NULL (SQL 
NULL propagation),
+            // rather than throwing INVALID_EXTRACT_BASE_FIELD_TYPE. A 
NullType column can arise
+            // e.g. from schema evolution with missing columns. This is scoped 
to dotted
+            // multipart field access (`col.a`); `col['key']`/`col[0]` go 
through
+            // UnresolvedExtractValue and still throw.
+            if (e.dataType == NullType) Literal(null, NullType)

Review Comment:
   This fold is reached by the **legacy** analyzer, but the single-pass 
`Resolver` takes a different route to the same method and never reaches this 
branch, so the fix is legacy-only.
   
   - Legacy: `AttributeSeq.getCandidatesForResolution` (package.scala:361-377) 
does **not** filter candidates by `isExtractable`, so the NullType candidate 
reaches this fold and returns `Literal(null)`.
   - Single-pass: `NameScope.getCandidatesForResolution` 
(NameScope.scala:759-766) filters candidates through 
`ExtractValue.isExtractable` **before** calling `resolveCandidates` 
(NameScope.scala:729). `isExtractable(NullType, ["a"])` is `false` (the 
`extractValue(NullType, _)` arm returns the error, 
complexTypeExtractors.scala:98), so the candidate is dropped, 
`resolveCandidates` receives an empty set, and the name fails to resolve — the 
single-pass analyzer **throws**.
   
   Net effect: `SELECT col.a FROM (SELECT null AS col)` returns a NULL row 
under the legacy analyzer (default) but throws under 
`spark.sql.analyzer.singlePassResolver.enabled`, and under dual-run sampling 
`HybridAnalyzer` (HybridAnalyzer.scala:43-51) throws the single-pass failure 
even though the legacy run succeeded. Round 1's `isExtractable`-changing 
approach didn't have this divergence — both analyzers would have returned NULL.
   
   Could you apply the fix consistently to both analyzers (e.g. make 
`isExtractable` treat a fully-NullType extraction chain as extractable so the 
single-pass path reaches the same NULL result, or handle NullType in the 
single-pass multipart path), or — if legacy-only is intended — justify it and 
add single-pass coverage that locks in the behavior?



##########
sql/core/src/test/resources/sql-tests/inputs/extract-value-resolution-edge-cases.sql:
##########
@@ -8,3 +8,12 @@ SELECT col1.a, a FROM t1 ORDER BY col1.a;
 SELECT split(col1, '-')[1] AS a FROM VALUES('a-b') ORDER BY split(col1, 
'-')[1];
 
 DROP TABLE t1;
+
+-- SPARK-57186: dotted multipart field access on a NullType base returns NULL 
instead of throwing
+-- INVALID_EXTRACT_BASE_FIELD_TYPE (NULL propagation; a NullType column can 
arise e.g. from schema
+-- evolution with missing columns). The fix is scoped to multipart name 
resolution (`col.a`); the
+-- `col['key']` and `col[0]` forms go through UnresolvedExtractValue and still 
throw, so the
+-- shared ExtractValue utility and the single-pass resolver are unaffected.
+SELECT col.a FROM (SELECT null AS col) t;
+SELECT col[0] FROM (SELECT null AS col) t;

Review Comment:
   Non-blocking: for a NullType base, `col.a` now returns NULL while the 
equivalent subscript forms `col[0]`/`col['key']` still throw 
`INVALID_EXTRACT_BASE_FIELD_TYPE`. `col.a` and `col['a']` are equivalent 
struct-field-access syntaxes elsewhere in Spark, so this asymmetry is worth 
confirming as intentional. It's largely tied to the analyzer-divergence finding 
above — a consistent cross-analyzer fix is the natural point to decide whether 
the subscript forms should follow `col.a` here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to