[I] Benchmark Variant performance: shredded vs unshredded (write cost, read, data skipping, storage) [hudi]

via GitHub Wed, 17 Jun 2026 02:22:29 -0700


voonhous opened a new issue, #19035:
URL: https://github.com/apache/hudi/issues/19035


   ## Summary
   
   Characterize the performance of Variant in Hudi so we can make data-backed 
claims
   about the shredded vs unshredded tradeoff. This is the benchmarking item 
referenced
   in the Variant epic roadmap (#17744). It is intentionally deferred out of 
the initial
   1.2 functional work; results will also back the planned Variant blog post.
   
   ## Motivation
   
   - We currently ship shredded and unshredded Variant (Spark 4.0+ write, Spark 
4.1+
     shredded read) without Hudi-specific performance numbers.
   - Shredding trades write cost for read pruning, predicate pushdown, and data
     skipping. We need to quantify that tradeoff on Hudi tables (COW and MOR) 
rather
     than borrow numbers from other projects.
   
   ## Baselines to compare
   
   Same logical data and same queries across:
   
   1. JSON stored as a STRING column
   2. Semi-structured data stored as a nested STRUCT
   3. Variant, unshredded
   4. Variant, shredded
   
   Table types: COW and MOR. Engine: Spark 4.1+ (required for shredded 
read-back).
   
   ## Metrics
   
   Write path:
   - write throughput / wall-clock for bulk insert and for upsert / MERGE
   - write amplification: base file size, log file size (MOR), total bytes
   - shredding overhead vs unshredded write
   
   Read path:
   - targeted single-field access latency (variant_get on a shredded field)
   - full-row reconstruction latency
   - bytes scanned and row groups pruned (predicate pushdown + data skipping on
     shredded fields)
   - projection pruning effectiveness
   
   Storage:
   - on-disk size by encoding, with and without shredding
   
   ## Workload knobs
   
   - field cardinality / number of top-level keys
   - nesting depth
   - shredding coverage (fraction of fields shredded vs left in the residual 
value)
   - selectivity of predicates over shredded fields
   - null density and type heterogeneity (exercise the typed_value fallback 
path)
   
   ## Deliverables
   
   - A reproducible benchmark harness (prefer extending existing Hudi benchmark
     tooling over a new one).
   - A short results writeup (tables / plots) suitable to feed the Variant blog 
post.
   - Recommendations: default shredding guidance, when shredding hurts (e.g.
     write-heavy MOR), and config tuning notes.
   
   ## Out of scope (tracked elsewhere)
   
   - read-then-reshred / compaction over shredded base files: #18931
   - colstats for shredded variants: #17988
   - auto shredding schema inference: #18038, #18937
   
   ## Notes
   
   - Parent epic: #17744
   - Extension of RFC-99.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Benchmark Variant performance: shredded vs unshredded (write cost, read, data skipping, storage) [hudi]

Reply via email to