Davis-Zhang-Onehouse opened a new pull request, #12803:
URL: https://github.com/apache/hudi/pull/12803

   ### Change Logs
   
   This PR optimizes the record size estimation logic by improving how we 
process commit metadata from the active timeline. The new implementation 
leverages parallel processing and smart filtering to compute average record 
sizes more efficiently. The key improvements include:
   
   - Filtering commits before metadata parsing to focus only on relevant actions
   - Processing multiple commits in parallel using Java streams
   - Using atomic accumulators for thread-safe statistics aggregation
   - More resilient error handling that continues processing despite individual 
commit failures
   
   The core logic was refactored into a dedicated AverageRecordSizeStats class 
to improve code organization and maintainability.
   
   ### Impact
   
   While there are no changes to the public API, this optimization provides 
significant performance benefits when computing record size estimates, 
particularly for timelines with many commits. The change:
   
   - Reduces processing time by parallelizing commit metadata parsing
   - Provides more accurate size estimates by considering multiple commits 
instead of stopping at the first valid one
   - Improves reliability through better error handling and atomic operations
   - Maintains backward compatibility with existing behavior when no valid 
commits are found
   ### Risk level (write none, low medium or high below)
   
   Low
   The changes are focused on internal implementation details and maintain the 
same functional behavior. Extensive unit tests have been added to verify:
   
   - Parallel processing correctness
   - Error handling scenarios
   - Boundary conditions (empty timeline, invalid commits)
   -  Compatibility with existing configuration parameters
   
   ### Documentation Update
   
   None
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to