voonhous opened a new issue, #19035:
URL: https://github.com/apache/hudi/issues/19035
## Summary
Characterize the performance of Variant in Hudi so we can make data-backed
claims
about the shredded vs unshredded tradeoff. This is the benchmarking item
referenced
in the Variant epic roadmap (#17744). It is intentionally deferred out of
the initial
1.2 functional work; results will also back the planned Variant blog post.
## Motivation
- We currently ship shredded and unshredded Variant (Spark 4.0+ write, Spark
4.1+
shredded read) without Hudi-specific performance numbers.
- Shredding trades write cost for read pruning, predicate pushdown, and data
skipping. We need to quantify that tradeoff on Hudi tables (COW and MOR)
rather
than borrow numbers from other projects.
## Baselines to compare
Same logical data and same queries across:
1. JSON stored as a STRING column
2. Semi-structured data stored as a nested STRUCT
3. Variant, unshredded
4. Variant, shredded
Table types: COW and MOR. Engine: Spark 4.1+ (required for shredded
read-back).
## Metrics
Write path:
- write throughput / wall-clock for bulk insert and for upsert / MERGE
- write amplification: base file size, log file size (MOR), total bytes
- shredding overhead vs unshredded write
Read path:
- targeted single-field access latency (variant_get on a shredded field)
- full-row reconstruction latency
- bytes scanned and row groups pruned (predicate pushdown + data skipping on
shredded fields)
- projection pruning effectiveness
Storage:
- on-disk size by encoding, with and without shredding
## Workload knobs
- field cardinality / number of top-level keys
- nesting depth
- shredding coverage (fraction of fields shredded vs left in the residual
value)
- selectivity of predicates over shredded fields
- null density and type heterogeneity (exercise the typed_value fallback
path)
## Deliverables
- A reproducible benchmark harness (prefer extending existing Hudi benchmark
tooling over a new one).
- A short results writeup (tables / plots) suitable to feed the Variant blog
post.
- Recommendations: default shredding guidance, when shredding hurts (e.g.
write-heavy MOR), and config tuning notes.
## Out of scope (tracked elsewhere)
- read-then-reshred / compaction over shredded base files: #18931
- colstats for shredded variants: #17988
- auto shredding schema inference: #18038, #18937
## Notes
- Parent epic: #17744
- Extension of RFC-99.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]