Davis-Zhang-Onehouse opened a new pull request, #12803: URL: https://github.com/apache/hudi/pull/12803
### Change Logs This PR optimizes the record size estimation logic by improving how we process commit metadata from the active timeline. The new implementation leverages parallel processing and smart filtering to compute average record sizes more efficiently. The key improvements include: - Filtering commits before metadata parsing to focus only on relevant actions - Processing multiple commits in parallel using Java streams - Using atomic accumulators for thread-safe statistics aggregation - More resilient error handling that continues processing despite individual commit failures The core logic was refactored into a dedicated AverageRecordSizeStats class to improve code organization and maintainability. ### Impact While there are no changes to the public API, this optimization provides significant performance benefits when computing record size estimates, particularly for timelines with many commits. The change: - Reduces processing time by parallelizing commit metadata parsing - Provides more accurate size estimates by considering multiple commits instead of stopping at the first valid one - Improves reliability through better error handling and atomic operations - Maintains backward compatibility with existing behavior when no valid commits are found ### Risk level (write none, low medium or high below) Low The changes are focused on internal implementation details and maintain the same functional behavior. Extensive unit tests have been added to verify: - Parallel processing correctness - Error handling scenarios - Boundary conditions (empty timeline, invalid commits) - Compatibility with existing configuration parameters ### Documentation Update None ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
