[PR] [SPARK-57176][SQL][4.x] Extend nested column pruning through array-returning functions [spark]

via GitHub Fri, 05 Jun 2026 10:19:30 -0700


sunchao opened a new pull request, #56345:
URL: https://github.com/apache/spark/pull/56345


   ### Why are the changes needed?
   
   This is the `branch-4.x` backport of #56227, merged to `master` as
   `042ad7d0c4ac1c4d3e9fdeb48e2695fdeb861135`.
   
   [SPARK-57176](https://issues.apache.org/jira/browse/SPARK-57176) follows
   [SPARK-57022](https://issues.apache.org/jira/browse/SPARK-57022), which 
added nested column
   pruning for `transform` over `array<struct>` inputs.
   
   Array-returning functions still retain the complete input element struct 
even when downstream
   expressions and lambdas only require a subset of nested fields. For example:
   
   ```sql
   SELECT filter(friends, friend -> friend.last = 'Smith').first
   FROM contacts
   ```
   
   If `friends` contains `first`, `middle`, and `last`, Spark reads all three 
fields even though the
   query only requires `first` and `last`.
   
   ### What changes were proposed in this PR?
   
   - Merge downstream result-field requirements with lambda requirements for 
`filter` and
     comparator-based `array_sort`.
   - Propagate projected element schemas through `reverse`, `shuffle`, `slice`, 
and `array_compact`.
   - Rewrite bound lambda variable types and nested field ordinals after 
pruning.
   - Retain the complete element schema when the whole result is used, when a 
lambda consumes the
     whole element, or when default `array_sort` natural ordering requires the 
full struct.
   
   The conflict resolution is adapted to the transform-only SPARK-57022 
implementation currently on
   `branch-4.x`. It does not include the separate SPARK-57175 optimization for 
`exists` and `forall`.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. Eligible queries using array-returning functions over arrays of structs 
can read a narrower
   input schema. Query results and SQL APIs are unchanged.
   
   ### How was this patch tested?
   
   - `JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home 
PATH=/opt/homebrew/opt/openjdk@17/bin:$PATH build/sbt "catalyst/testOnly 
org.apache.spark.sql.catalyst.expressions.SchemaPruningSuite" "sql/testOnly 
org.apache.spark.sql.execution.datasources.parquet.ParquetV1SchemaPruningSuite 
org.apache.spark.sql.execution.datasources.parquet.ParquetV2SchemaPruningSuite 
org.apache.spark.sql.execution.datasources.orc.OrcV1SchemaPruningSuite 
org.apache.spark.sql.execution.datasources.orc.OrcV2SchemaPruningSuite -- -z 
Array" catalyst/scalastyle sql/scalastyle`
     - Catalyst: 11 tests passed.
     - Parquet/ORC datasource suites: 260 tests passed.
     - Catalyst and SQL scalastyle: no errors or warnings.
   - `git diff --check`
   - ASCII scan of all changed Scala files
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Codex (GPT-5)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-57176][SQL][4.x] Extend nested column pruning through array-returning functions [spark]

Reply via email to