sunchao opened a new pull request, #56227: URL: https://github.com/apache/spark/pull/56227
### Why are the changes needed? [SPARK-57176](https://issues.apache.org/jira/browse/SPARK-57176) follows [SPARK-57022](https://issues.apache.org/jira/browse/SPARK-57022), which added nested column pruning for `transform` over `array<struct>` inputs. Array-returning functions still retain the complete input element struct even when downstream expressions and lambdas only require a subset of nested fields. For example: ```sql SELECT filter(friends, friend -> friend.last = 'Smith').first FROM contacts ``` If `friends` contains `first`, `middle`, and `last`, Spark currently reads all three fields even though the query only requires `first` and `last`. ### What changes were proposed in this PR? - Merge downstream result-field requirements with lambda requirements for `filter` and comparator-based `array_sort`. - Propagate projected element schemas through `reverse`, `shuffle`, `slice`, and `array_compact`. - Rewrite bound lambda variable types and nested field ordinals after pruning. - Retain the complete element schema when the whole result is used, when a lambda consumes the whole element, or when default `array_sort` natural ordering requires the full struct. Functions that inspect full element equality or natural ordering remain out of scope because dropping nested fields could change results. ### Does this PR introduce _any_ user-facing change? Yes. Eligible queries using array-returning functions over arrays of structs can read a narrower input schema. Query results and SQL APIs are unchanged. ### How was this patch tested? - `JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home PATH=/opt/homebrew/opt/openjdk@17/bin:$PATH build/sbt "catalyst/testOnly org.apache.spark.sql.catalyst.expressions.SchemaPruningSuite" "sql/testOnly org.apache.spark.sql.execution.datasources.parquet.ParquetV1SchemaPruningSuite org.apache.spark.sql.execution.datasources.parquet.ParquetV2SchemaPruningSuite org.apache.spark.sql.execution.datasources.orc.OrcV1SchemaPruningSuite org.apache.spark.sql.execution.datasources.orc.OrcV2SchemaPruningSuite -- -z Array"` - `JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home PATH=/opt/homebrew/opt/openjdk@17/bin:$PATH build/sbt catalyst/scalastyle sql/scalastyle` - `git diff --check` ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Codex (GPT-5) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
