qlong opened a new pull request, #16714:
URL: https://github.com/apache/iceberg/pull/16714

   **Changes**
   
   This PR is part of the work to support variant extraction pushdown, the core 
change is to introduce new parquet readers to read selected variant paths 
instead of the whole variant:
   
   - Add selective Parquet readers (ParquetVariantExtractionReaders, 
VariantExtractionPathResolver) to read only shredded typed_value columns for 
requested extraction paths.
   - Add Spark row reader adapter (SparkVariantExtractionReaders, 
SparkParquetReaders) to materialize extraction slots from the engine read 
schema instead of full variant blobs.
   - Wire engine read schema from SparkBatch through SparkInputPartition to 
RowDataReader only (row Parquet path).
   - Update PruneColumnsWithoutReordering so annotated extraction structs map 
back to Iceberg VARIANT columns in the scan projection.
   
   Issue: https://github.com/apache/iceberg/issues/16448
   
   **Note for reviewer**
   - PathUtil.java is mostly copied from existing PR #15384, will rebase once 
that PR is merged. 
   - To reduce the scope, the new selective readers are only wired in for batch 
row scan. We can wire to other readers as a follow up. 
   - To reduce the scope, only supports extracting mostly used data types. Do 
not support extracting arrays, struct / nested struct.Request shredded columns 
for unsupported types will lead to read the whole variant (extraction pushdown 
rejected). 
   
   **Test**
   
   Use 1-day Github activities data, ingested as json, shredded variants with 
299 shreddred columns. 
   
   Baseline: `gha-payload-iceberg-20260605` · variant + **extraction pushdown 
ON**  + **selective shredded variant extraction Parquet readers**
   Compare A: same run  with payload stored as  `string_json`
   Compare B: `gha-payload-iceberg-nopushed-20260605` · variant + pushdown OFF, 
read whole variant
   Median of 3 timed runs per query (Spark `Time taken:`).
   | Query | Variant + pushdown (s) | string_json (s) | Δ vs baseline | Variant 
no-pushdown (s) | Δ vs baseline |
   
|-------|------------------------:|----------------:|--------------:|------------------------:|--------------:|
   | c-q01 | 2.605 | 2.373 | −8.9% | 2.945 | +13.0% |
   | c-q04 | 3.875 | 6.412 | +65.5% | 72.082 | +1760% |
   | c-q05b | 3.506 | 5.683 | +62.1% | 39.154 | +1017% |
   | c-q06 | 4.668 | 6.583 | +41.0% | 76.935 | +1548% |
   | c-q07 | 4.714 | 4.490 | −4.8% | 75.033 | +1492% |
   | c-q08 | 3.701 | 4.707 | +27.2% | 87.059 | +2252% |
   | c-q09 | 5.102 | 6.668 | +30.7% | 72.560 | +1322% |
   | c-q10 | 4.395 | 6.568 | +49.4% | 68.179 | +1451% |
   | c-q11 | 4.495 | 6.384 | +42.0% | 67.985 | +1413% |
   | c-q12 | 3.995 | 4.284 | +7.2% | 70.509 | +1665% |
   | c-q13 | 3.911 | 4.060 | +3.8% | 39.614 | +913% |
   | c-q14 | 4.769 | 5.450 | +14.3% | 63.331 | +1228% |
   | **Total (Σ)** | **49.74** | **63.66** | **+28.0%** | **735.39** | 
**+1379%** |
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to