sunchao opened a new pull request, #56070:
URL: https://github.com/apache/spark/pull/56070
### Why are the changes needed?
Spark can prune nested struct fields referenced directly by a query, but it
does not currently prune nested fields read through the lambda variable of
`transform` over an `array<struct>` column.
For example:
```sql
SELECT transform(rule_results, rule ->
named_struct(
'rule_public_id', rule.rule_public_id,
'rule_version', rule.rule_version))
FROM events
```
If `rule_results` contains additional fields, Spark currently retains the
full element struct in the scan schema even though only two nested fields are
required. This causes unnecessary Parquet and ORC input reads for wide array
element schemas.
This change addresses
[SPARK-57022](https://issues.apache.org/jira/browse/SPARK-57022).
### What changes were proposed in this pull request?
- Recognize statically identifiable nested field reads through the element
variable of `ArrayTransform`.
- Build a projected array element schema from exactly those referenced
fields and propagate it to the scan input.
- Rewrite the bound lambda variable type and `GetStructField` ordinals
against the projected element schema after pruning.
- Fall back to retaining the full element schema when the lambda consumes
the complete element, so pruning is applied only when it is safe.
- Add Catalyst and datasource tests covering ordinal rewrites, deep nesting,
nested input paths with null values, indexed lambdas, case-insensitive
resolution, and conservative fallback.
The implementation intentionally has two stages. `SchemaPruning` discovers
which fields the lambda needs from the array element. `ProjectionOverSchema`
then rewrites the lambda against the narrower element type because pruning can
change field ordinals. For example, pruning `struct<a, b, c>` to `struct<a, c>`
moves `c` from ordinal `2` to ordinal `1`.
### Does this PR introduce _any_ user-facing change?
Yes. Eligible queries using `transform` over arrays of structs can read a
narrower input schema. Query results and SQL APIs are unchanged.
### How was this patch tested?
- `build/sbt "catalyst/testOnly
org.apache.spark.sql.catalyst.expressions.SchemaPruningSuite"`
- `build/sbt "sql/testOnly
org.apache.spark.sql.execution.datasources.parquet.ParquetV1SchemaPruningSuite
org.apache.spark.sql.execution.datasources.parquet.ParquetV2SchemaPruningSuite
org.apache.spark.sql.execution.datasources.orc.OrcV1SchemaPruningSuite
org.apache.spark.sql.execution.datasources.orc.OrcV2SchemaPruningSuite -- -z
ArrayTransform"`
- `git diff --check apache/master...HEAD`
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Codex (GPT-5)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]