PavithranRick opened a new pull request, #17693:
URL: https://github.com/apache/hudi/pull/17693
### Describe the issue this Pull Request addresses
This PR adds large-scale testing coverage for the metadata table, focusing
on performance and correctness when tables are partitioned by `datestr`. The
motivation is to validate lookup behavior and scalability when using metadata
partitions—specifically FILES and Column Stats—under realistic data
distributions and query patterns.
### Summary and Changelog
**Summary**
Introduce a large-scale metadata table test framework that:
- Generates controlled data distributions across file groups
- Populates column statistics deterministically
- Benchmarks lookup performance using the Column Stats partition over date
ranges
**Changelog / TODO**
- Added large-scale metadata table tests with `datestr` partitioning
- Enabled and validated FILES metadata partition at scale
- Enabled and validated Column Stats metadata partition at scale
- Implemented configurable data generation to control:
- Column-to-file-group spread
- Record distribution per partition
- Column statistics characteristics (e.g., min/max, value skew)
- Added lookup benchmarks for column-value predicates combined with date
range filters
- Collected and reported lookup latency metrics for Column Stats–based
pruning
### Impact
- Improves confidence in metadata table scalability and performance
characteristics
- Provides measurable performance insights for Column Stats–based lookups
over large date ranges
- No user-facing API changes
- Test-only impact; no change to production behavior
### Risk Level
**low**
Changes are limited to test infrastructure and benchmarking.
Verification includes:
- Running tests at large scale with multiple partitions and file groups
- Validating correctness of lookup results against expected matches
- Comparing lookup performance with and without Column Stats partition
enabled
### Documentation Update
none
### Contributor's checklist
- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Enough context is provided in the sections above
- [ ] Adequate tests were added if applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]