Chao Sun created SPARK-57175:
--------------------------------

             Summary: Extend nested column pruning to exists and forall over 
arrays of structs
                 Key: SPARK-57175
                 URL: https://issues.apache.org/jira/browse/SPARK-57175
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 4.2.0
            Reporter: Chao Sun


SPARK-57022 added nested column pruning for transform over array<struct> 
inputs. The same optimization does not yet apply to the exists and forall 
higher-order array functions.

For example:

{code:sql}
SELECT exists(rule_results, rule -> rule.rule_version > 10)
FROM events
{code}

If rule_results contains additional fields, Spark currently retains the full 
element struct in the scan schema even though the predicate only reads 
rule_version. This causes unnecessary Parquet and ORC input reads for wide 
array element schemas.

The optimization can be extended safely to exists and forall because both 
consume array elements to produce a boolean result; neither returns the 
original elements. The implementation should reuse the two-stage approach 
introduced by SPARK-57022:

* SchemaPruning identifies statically known GetStructField chains rooted at the 
element lambda variable and propagates a narrower array element schema to the 
scan.
* ProjectionOverSchema rewrites the bound lambda variable type and nested field 
ordinals after pruning.
* If the lambda consumes the whole element, Spark conservatively retains the 
complete element schema.

ArrayFilter and ArraySort are intentionally out of scope because they return 
original input elements and therefore require a different downstream-schema 
design.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to