[
https://issues.apache.org/jira/browse/SPARK-57175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-57175:
-----------------------------------
Labels: pull-request-available (was: )
> Extend nested column pruning to exists and forall over arrays of structs
> ------------------------------------------------------------------------
>
> Key: SPARK-57175
> URL: https://issues.apache.org/jira/browse/SPARK-57175
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.2.0
> Reporter: Chao Sun
> Priority: Major
> Labels: pull-request-available
>
> SPARK-57022 added nested column pruning for transform over array<struct>
> inputs. The same optimization does not yet apply to the exists and forall
> higher-order array functions.
> For example:
> {code:sql}
> SELECT exists(rule_results, rule -> rule.rule_version > 10)
> FROM events
> {code}
> If rule_results contains additional fields, Spark currently retains the full
> element struct in the scan schema even though the predicate only reads
> rule_version. This causes unnecessary Parquet and ORC input reads for wide
> array element schemas.
> The optimization can be extended safely to exists and forall because both
> consume array elements to produce a boolean result; neither returns the
> original elements. The implementation should reuse the two-stage approach
> introduced by SPARK-57022:
> * SchemaPruning identifies statically known GetStructField chains rooted at
> the element lambda variable and propagates a narrower array element schema to
> the scan.
> * ProjectionOverSchema rewrites the bound lambda variable type and nested
> field ordinals after pruning.
> * If the lambda consumes the whole element, Spark conservatively retains the
> complete element schema.
> ArrayFilter and ArraySort are intentionally out of scope because they return
> original input elements and therefore require a different downstream-schema
> design.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]