[ 
https://issues.apache.org/jira/browse/SPARK-34956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18080149#comment-18080149
 ] 

Jake Waro commented on SPARK-34956:
-----------------------------------

h2. Adding evidence for SPARK-34956 (multi-field nested column pruning on 
Generator output)

This issue is still affected as of Spark 3.5.3. Below is a minimal reproducer 
with cost data, in case it's useful for prioritization or as a regression-test 
artifact for any future fix.

h2. Why this comment

SPARK-34956 tracks extending nested column pruning on Generator output to the 
multi-field case (the single-field case being handled by its predecessor 
SPARK-34638). I built a self-contained reproducer for our production scenario 
and discovered, while reading \{{NestedColumnAliasing.scala}}, that the 
behavior we were seeing is exactly the limitation this JIRA tracks:

{quote}
{\{// We only process single field case. For multiple field case, we cannot}}
{\{// directly move field extractor into the generator expression.}}
{\{// TODO(SPARK-34956): support multiple fields.}}
{quote}

The reproducer demonstrates the limitation cleanly, quantifies the cost on 
synthetic data, and shows the issue persists on Spark 3.5.0–3.5.3.

h2. Minimal reproducer

A self-contained Maven project: [SparkNestedPruningRepro 
gist|https://gist.github.com/jwaro/441abb857175f9c9d69041bd2a4a81da]. Public 
artifacts only (Apache Spark, Apache Iceberg, Scala). No external services or 
proprietary data.

To run:
{code:bash}
mvn compile exec:exec
{code}

Exit code is non-zero when the limitation is present (so the reproducer can 
serve as a regression-test artifact for a fix).

h2. Schema

{code:sql}
CREATE TABLE repro_cat.db.two_level (
  id BIGINT NOT NULL,
  outer ARRAY<STRUCT<
    a: STRING,
    padding_outer: STRING,
    inner: ARRAY<STRUCT<
      x: INT,
      y: INT,
      z: INT,
      padding_a: STRING,
      padding_b: STRING,
      padding_c: STRING,
      padding_d: STRING
    >>
  >>
)
USING iceberg
TBLPROPERTIES ('format-version' = '2')
{code}

1 million rows inserted, each with one outer element containing a 5-element 
inner array. The \{{padding_*}} string columns vary per row to defeat Parquet's 
RLE/dictionary encoding so the bytesRead difference is observable locally.

h2. Failing query — multiple fields from Generator output

Accesses *both* \{{inner.x}} and \{{inner.y}} (two fields from the generator 
output — the case this JIRA tracks):

{code:scala}
spark.table("repro_cat.db.two_level")
  .select(col("id"), explode(arrays_zip(col("outer.a"), 
col("outer.inner"))).as("zipped"))
  .select(col("id"), col("zipped.a").as("a"), 
explode(col("zipped.inner")).as("inner_elem"))
  .select(col("id"), col("inner_elem.x"), col("inner_elem.y"))
  .agg(sum("x"), sum("y"))
{code}

Observed \{{BatchScan.readSchema()}} (the only authoritative source for what 
got pushed into the V2 scan; the printed plan-tree \{{BatchScan[outer#34]}} 
shows output attributes, not the pruned schema):

{noformat}
root
 |-- outer: array
 |    |-- element: struct
 |    |    |-- a: string
 |    |    |-- inner: array
 |    |    |    |-- element: struct
 |    |    |    |    |-- x: integer
 |    |    |    |    |-- y: integer
 |    |    |    |    |-- z: integer                  <-- not accessed
 |    |    |    |    |-- padding_a: string           <-- not accessed
 |    |    |    |    |-- padding_b: string           <-- not accessed
 |    |    |    |    |-- padding_c: string           <-- not accessed
 |    |    |    |    |-- padding_d: string           <-- not accessed
{noformat}

8 leaves total; 5 of them never referenced by the query. This matches the 
\{{TODO(SPARK-34956)}} branch in 
\{{NestedColumnAliasing.GeneratorNestedColumnAliasing}} (the bailout when 
\{{nestedFieldsOnGenerator.size > 1}}).

h2. Control case — single field from Generator output

The same table, queried with a single field from the generator output, prunes 
correctly (this is the SPARK-34638 case):

{code:scala}
spark.table("repro_cat.db.two_level")
  .select(col("id"), explode(col("outer.a")).as("a"))
  .agg(count("a"))
{code}

{\{readSchema}}:
{noformat}
root
 |-- outer: array
 |    |-- element: struct
 |    |    |-- a: string
{noformat}

1 leaf. Single-field generator pruning works; multi-field does not — exactly as 
SPARK-34956 describes.

h2. I/O cost

Captured via \{{InputMetrics.bytesRead}} on the source-scan stage on 1M 
synthetic rows:

|| Case || Fields from generator output || bytesRead ||
| CONTROL (SPARK-34638 case) | 1 | 1,240,898 bytes |
| FAILING (SPARK-34956 case) | 2 | 33,603,904 bytes |
| Ratio | | *27.1×* |

The failing case reads 27× more bytes than the control despite asking for only 
one additional logical leaf. The wide \{{readSchema}} forces Iceberg to pull 
every padding field from disk.

In a production deployment, the same access pattern over a real Iceberg V2 
table produces a ~6× bytesRead inflation and a ~1.7–3.3× wall-time increase on 
individual tests, contributing to a ~2.0× suite-wide runtime gap vs. an 
equivalent flat-table layout.

h2. Version sweep

Confirmed identical 27.1× bytesRead ratio and 8-leaf \{{readSchema}} on Spark 
3.5.0, 3.5.1, 3.5.2, 3.5.3 (Iceberg 1.9.2 throughout). Not a patch-release 
regression; the limitation has been present throughout the 3.5 series. This may 
be worth updating the JIRA's "Affects Version" field, which currently lists 
only 3.2.0.

h2. Pruning-related configs are no-ops (as expected)

The reproducer also reruns the failing query under four configs commonly cited 
as pruning-related:

* \{{spark.sql.optimizer.nestedSchemaPruning.enabled=true}}
* \{{spark.sql.nestedSchemaPruning.enabled=true}}
* \{{spark.sql.optimizer.expression.nestedPruning.enabled=true}}
* \{{spark.sql.optimizer.nestedPredicatePushdown.enabled=true}}

All four produce identical 8-leaf \{{readSchema}}. Consistent with this JIRA's 
"the rule itself is what's limited" framing.

h2. Relation to other recent work

* *SPARK-51831* / [PR 51046|https://github.com/apache/spark/pull/51046] (merged 
Sep 2025) closed the flat-column V2 pruning gap reported in 
[apache/iceberg#9268|https://github.com/apache/iceberg/issues/9268]. The PR 
author explicitly noted nested arrays-of-structs were not the focus and left 
them as a follow-up — SPARK-34956 is that follow-up.
* *SPARK-42879* (open, unassigned) covers a related multi-field projection case 
but for a different access shape.

h2. Reproducer as a regression-test artifact

The reproducer's exit code is non-zero when the limitation is present and zero 
otherwise, so it can serve as a regression test for a candidate fix. Happy to 
maintain it or move it to a more permanent location if that would help.

h2. Environment

* Spark 3.5.0 / 3.5.1 / 3.5.2 / 3.5.3 (all reproduce)
* Iceberg 1.9.2 (\{{iceberg-spark-runtime-3.5_2.12}})
* Scala 2.12.19
* JDK 17 (Temurin)
* Local-mode single-node Spark; hadoop-catalog Iceberg over a temp directory
* AQE enabled (does not affect the result)

> Support multiple fields when nested column pruning on Generator output 
> -----------------------------------------------------------------------
>
>                 Key: SPARK-34956
>                 URL: https://issues.apache.org/jira/browse/SPARK-34956
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: L. C. Hsieh
>            Priority: Major
>
> SPARK-34638 enhances nested column pruning rule on Generator's output. 
> SPARK-34638 supports single field case, e.g. 
> {{df.select(explode($"items").as("item")).select($"item.a")}}. This ticket is 
> open for tracking multiple-field support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to