Chao Sun created SPARK-57022:
--------------------------------
Summary: Support nested column pruning for transform over arrays
of structs
Key: SPARK-57022
URL: https://issues.apache.org/jira/browse/SPARK-57022
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 4.2.0
Reporter: Chao Sun
Spark supports nested column pruning for ordinary nested struct field accesses,
but it does not currently prune nested fields accessed through the lambda
variable of transform over an array<struct> column.
For example:
{code:sql}
SELECT transform(rule_results, rule ->
named_struct(
'rule_public_id', rule.rule_public_id,
'rule_version', rule.rule_version))
FROM events
{code}
If rule_results contains additional fields, Spark currently retains the full
element struct in the scan schema, even though the query only reads
rule_public_id and rule_version. This can increase Parquet and ORC I/O for
datasets containing wide array element structs.
The proposed change extends nested schema pruning to recognize statically
identifiable nested field accesses through the element variable of
ArrayTransform. It builds a projected element schema from the referenced fields
and carries that narrower schema back to the array input.
Because pruning fields changes the physical ordinal layout of the element
struct, the projected expression must also rewrite the bound lambda variable
type and nested GetStructField ordinals against the pruned schema. For example,
if struct<a, b, c> is pruned to struct<a, c>, field c moves from ordinal 2 to 1.
The optimization remains conservative: if the lambda consumes the complete
array element, such as x -> struct(x.a, x), Spark retains the full element
schema instead of pruning it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]