qlong opened a new pull request, #16714: URL: https://github.com/apache/iceberg/pull/16714
**Changes** This PR is part of the work to support variant extraction pushdown, the core change is to introduce new parquet readers to read selected variant paths instead of the whole variant: - Add selective Parquet readers (ParquetVariantExtractionReaders, VariantExtractionPathResolver) to read only shredded typed_value columns for requested extraction paths. - Add Spark row reader adapter (SparkVariantExtractionReaders, SparkParquetReaders) to materialize extraction slots from the engine read schema instead of full variant blobs. - Wire engine read schema from SparkBatch through SparkInputPartition to RowDataReader only (row Parquet path). - Update PruneColumnsWithoutReordering so annotated extraction structs map back to Iceberg VARIANT columns in the scan projection. Issue: https://github.com/apache/iceberg/issues/16448 **Note for reviewer** - PathUtil.java is mostly copied from existing PR #15384, will rebase once that PR is merged. - To reduce the scope, the new selective readers are only wired in for batch row scan. We can wire to other readers as a follow up. - To reduce the scope, only supports extracting mostly used data types. Do not support extracting arrays, struct / nested struct.Request shredded columns for unsupported types will lead to read the whole variant (extraction pushdown rejected). **Test** Use 1-day Github activities data, ingested as json, shredded variants with 299 shreddred columns. Baseline: `gha-payload-iceberg-20260605` · variant + **extraction pushdown ON** + **selective shredded variant extraction Parquet readers** Compare A: same run with payload stored as `string_json` Compare B: `gha-payload-iceberg-nopushed-20260605` · variant + pushdown OFF, read whole variant Median of 3 timed runs per query (Spark `Time taken:`). | Query | Variant + pushdown (s) | string_json (s) | Δ vs baseline | Variant no-pushdown (s) | Δ vs baseline | |-------|------------------------:|----------------:|--------------:|------------------------:|--------------:| | c-q01 | 2.605 | 2.373 | −8.9% | 2.945 | +13.0% | | c-q04 | 3.875 | 6.412 | +65.5% | 72.082 | +1760% | | c-q05b | 3.506 | 5.683 | +62.1% | 39.154 | +1017% | | c-q06 | 4.668 | 6.583 | +41.0% | 76.935 | +1548% | | c-q07 | 4.714 | 4.490 | −4.8% | 75.033 | +1492% | | c-q08 | 3.701 | 4.707 | +27.2% | 87.059 | +2252% | | c-q09 | 5.102 | 6.668 | +30.7% | 72.560 | +1322% | | c-q10 | 4.395 | 6.568 | +49.4% | 68.179 | +1451% | | c-q11 | 4.495 | 6.384 | +42.0% | 67.985 | +1413% | | c-q12 | 3.995 | 4.284 | +7.2% | 70.509 | +1665% | | c-q13 | 3.911 | 4.060 | +3.8% | 39.614 | +913% | | c-q14 | 4.769 | 5.450 | +14.3% | 63.331 | +1228% | | **Total (Σ)** | **49.74** | **63.66** | **+28.0%** | **735.39** | **+1379%** | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
