danny0405 opened a new pull request, #19022:
URL: https://github.com/apache/hudi/pull/19022

   ### Describe the issue this Pull Request addresses
   
   Flink `WriteProfile` and `DeltaWriteProfile` refresh average record size 
from commit or delta-commit metadata. When the current timeline has no eligible 
commit metadata for size estimation, the refresh path could fall back to the 
configured default record size estimate, even if the profile had already 
learned a better value from a preceding eligible commit.
   
   This can make `recordsPerBucket` jump back to the default estimate after a 
reload, affecting insert bucket sizing for Flink COW and MOR writes. This PR 
keeps the preceding profiled average size when a later refresh cannot derive a 
valid estimate from eligible commits or delta commits.
   
   ### Summary and Changelog
   
   Flink write profiles now preserve the last successful average record size 
estimate across reloads when no eligible commit metadata is available for the 
current estimation pass.
   
   - Updated `WriteProfile.averageBytesPerRecord()` to seed estimation from the 
existing `avgSize` once initialized, falling back to 
`COPY_ON_WRITE_RECORD_SIZE_ESTIMATE` only before any profiled value exists.
   - Updated `DeltaWriteProfile.averageBytesPerRecord()` with the same reuse 
behavior for MOR commit and delta-commit estimation.
   - Moved average-size logging into `WriteProfile.recordProfile()` so the 
refreshed value is logged once after the profile is updated.
   - Added regression coverage in `TestBucketAssigner` for:
     - `testWriteProfileReusesPreviousAvgSizeWhenNoEligibleCommitOnReload`
     - 
`testDeltaWriteProfileReusesPreviousAvgSizeWhenNoEligibleDeltaCommitOnReload`
   
   ### Impact
   
   This affects Flink write bucket sizing for COW and MOR tables when profile 
reloads encounter commits or delta commits that are not eligible for 
record-size estimation. The change avoids reverting to the configured default 
estimate after a better preceding estimate has already been learned.
   
   There are no public API, config, storage format, or compatibility changes. 
Performance impact is negligible; the change only reuses an already cached 
in-memory value during profile refresh.
   
   ### Risk Level
   
   low
   
   The change is limited to Flink `WriteProfile` and `DeltaWriteProfile` 
average record size estimation. The main behavioral risk is retaining a stale 
estimate longer when no eligible metadata exists, but that is the intended 
behavior and is safer than reverting to a less accurate configured default. 
Once eligible commit metadata is available again, the estimate is refreshed 
from metadata as before.
   
   Verified with:
   
   `mvn -pl hudi-flink-datasource/hudi-flink -am -Dtest=TestBucketAssigner 
-Dsurefire.failIfNoSpecifiedTests=false test -DskipITs`
   
   Result: `TestBucketAssigner` ran 16 tests with 0 failures.
   
   ### Documentation Update
   
   none
   
   This is an internal Flink write profile estimation fix with no new 
user-facing configuration, API, or documented behavior change.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Enough context is provided in the sections above
   - [ ] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to