Hi all,
I would like to discuss extending nested-field pruning through explode /
posexplode,
beyond what NestedColumnAliasing covers today, and to get input on the
right shape
of the change before going deeper into review.
Background
----------
NestedColumnAliasing (SPARK-27707, #24637 from viirya / dongjoon-hyun in
2019, with
the SPARK-29721 follow-up #26978) prunes nested struct fields through
Generate. It
works well for the common single-explode-with-projection case. On wide
nested schemas
it still leaves significant I/O on the table for two shapes:
1. Chained / repeated explode over deeply nested arrays
(e.g. array<struct<array<struct<...>>>>).
2. posexplode where only the position is consumed and the struct fields
are not.
In both cases, the Parquet / ORC reader still materialises the full struct,
and on
schemas with hundreds of fields under array<struct> this can be the
dominant cost
of the query.
Proposal
--------
Add a Catalyst optimizer rule that handles these cases at scan time,
alongside (not
replacing) NestedColumnAliasing. The new rule fires only when
NestedColumnAliasing
cannot apply, so for the existing covered cases the plan is bit-identical.
A supporting expression, NestedArraysZip, rebuilds pruned nested arrays at
arbitrary
depth when the post-explode plan still requires array shape.
The rule is on by default and gated by
spark.sql.optimizer.pruneNestedFieldsThroughGenerateForScan.enabled
so it can be turned off without redeploying.
Measurements
------------
I have run two representative workloads:
* Production-scale: Taboola's ~1,200-column pageview model, with deeply
nested array<struct> columns. Queries that project a handful of
post-explode
fields read ~20-30 GB per task with the rule enabled versus ~200+ GB
with
the rule disabled. This matches a hand-authored "thin-schema" read of
the
same model that we have been using as a baseline, so the rule reaches
the
practical I/O floor for this workload.
* Small / mid-width: a synthetic schema with ~20 top-level fields and one
array<struct<...20 inner fields...>>. Queries that consume only 2-3
inner
fields show ~30% reduction in scan I/O.
Result rows and aggregates are bit-equal to runs with the new rule disabled.
Patch
-----
A working implementation exists: https://github.com/apache/spark/pull/54283
It is large (~7.3k LoC including tests) because it covers the
chained-explode,
position-only-posexplode, and plan-rebuild cases together. I think it is
the right
thing to do to split it into a chain of smaller PRs:
1. Shared infrastructure (SelectedField, ProjectionOverSchema,
SchemaPruning).
2. NestedArraysZip expression and its tests.
3. Single-level explode pruning rule (the common case the existing rule
misses).
4. Chained / posexplode cases layered on top of (3).
Each would be independently reviewable, revertible, and config-gated.
Questions for the list
----------------------
1. Right shape - separate rule, or extension of NestedColumnAliasing?
I went with a separate rule to keep the existing rule's surface
unchanged and to
make the new behaviour easy to disable. I am open to instead folding the
new
cases into NestedColumnAliasing if reviewers prefer one rule with
broader scope.
2. SPIP? The user-facing surface is one new config flag and no API change,
but the
rule itself is non-trivial. I am happy to write a SPIP if the consensus
is that
one is warranted before code review begins.
3. Is anyone willing to shepherd / be a first reviewer on PR-1 (the
infrastructure
split) once it is up?
Disclosure
----------
The initial draft of the patch was authored with help from Claude Opus 4.5,
using
the existing NestedColumnAliasing rule and the SPARK-47230 / SPARK-29721
test cases
as anchors. After the draft I:
* read every changed file by hand and stepped through the optimizer rule
logic
against the existing Catalyst code;
* ran the new and the existing schema-pruning test suites locally;
* ran the rule against representative queries on the pageview model
described
above and manually compared results against runs with the rule disabled.
All tests, benchmark code, and the workload validation above were authored
and
driven by me. I am calling this out explicitly given recent list discussion
on
AI-assisted contributions.
Thanks for your time,
Igor Berman