wombatu-kun opened a new pull request, #16723: URL: https://github.com/apache/iceberg/pull/16723
The row-group filters (`ParquetMetricsRowGroupFilter`, `ParquetDictionaryRowGroupFilter`, `ParquetBloomRowGroupFilter`) rebuild their per-column maps (`stats` / `valueCounts` / `conversions`, etc.) for every row group, filling them with one entry per column from `Maps.newHashMap()`. On wide schemas the map's backing table (its `Node[]` bucket array) is reallocated several times per row group as it grows past the load-factor threshold. Sizing the maps to the row group's column count up front goes straight to the right capacity and avoids those intermediate table reallocations. The lazy caches (`dictCache`, `bloomCache`) and the conditional `fieldsWithBloomFilter` set are left at the default size since they are not filled to the column count. This runs on the scan-planning path for every Parquet file in every engine (one `shouldRead` per row group) and is behavior-preserving. Benchmarked the always-on `ParquetMetricsRowGroupFilter.shouldRead` with JMH and the gc profiler on a 64-column schema (one op = one row-group filter call): | Metric | Before | After | Delta | |---|---|---|---| | Allocation | 17,280 B/op | 14,448 B/op | -2,832 B (-16%) | | Time | 8.1 us/op | 6.3 us/op | faster (overlapping CIs) | `shouldRead` is a thin wrapper that does `new MetricsEvalVisitor().eval(...)`, so the maps are allocated inside `eval`; the benchmark drives the public `shouldRead` entry point, which fully contains `eval` (and `eval` is private, so it is the only entry point). The visitor allocation is identical on both runs and the only change between them is the map sizing, so the measured delta is attributable entirely to the pre-sizing. The dictionary and bloom filters get the identical pre-sizing for the same structural reason (their `shouldRead` is likewise a thin wrapper over an `eval` that fills maps to the column count per row group). Correctness is covered by the existing `TestMetricsRowGroupFilter`, `TestMetricsRowGroupFilterTypes`, `TestDictionaryRowGroupFilter`, and `TestBloomRowGroupFilter` suites. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
