ZitingShen opened a new pull request, #56536:
URL: https://github.com/apache/spark/pull/56536

   ### What changes were proposed in this pull request?
   
   - Make `ColumnarToRowExec` expose nullable row output when materializing a 
columnar batch, because the batch null bitmap is the execution-time source of 
truth.
   - Rebind parent input attributes after inserting row/columnar transitions so 
downstream row operators keep the execution-time nullability instead of stale 
planned attributes.
   - Add a regression covering whole-stage codegen, split consume, row 
boundaries, partial hash aggregate, generate, and take-ordered/project paths 
when a non-nullable columnar output physically contains a null.
   
   ### Why are the changes needed?
   
   `ColumnarToRowExec` can materialize a real null from a columnar batch even 
when the planned output attribute is non-nullable. Downstream row codegen can 
then trust the stale non-nullable attribute, skip the null check, and fail 
while materializing or consuming the row.
   
   One observed failing plan had this shape:
   
   ```text
   HashAggregate
     HashAggregate
       ...
         ColumnarToRow
           Scan parquet ...
   ```
   
   In that case the Parquet reader produced a physical null for a column whose 
planned output attribute was non-nullable, and downstream generated row code 
failed with a `UTF8String.getBaseObject()` `NullPointerException`.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. Queries that read a physical null through `ColumnarToRowExec` despite 
stale non-nullable planned metadata now preserve the null through row 
materialization instead of crashing in downstream row codegen.
   
   ### How was this patch tested?
   
   - Added `SparkPlanSuite` regression coverage for null materialization 
through `ColumnarToRowExec` and representative downstream row consumers.
   - Ran `git diff --check`.
   - Attempted `build/sbt 'sql/testOnly 
org.apache.spark.sql.execution.SparkPlanSuite -- -z "ColumnarToRowExec should 
materialize null values from non-nullable columnar output"'`, but the local 
checkout only has Java `25.0.2` and current Spark requires Java `25.0.3` or 
later before test execution.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Codex


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to