shangxinli opened a new pull request, #18778:
URL: https://github.com/apache/hudi/pull/18778

   ### Describe the issue this Pull Request addresses
   
   Part of the freshness-tracking work discussed in #17512. This PR implements 
**Phase 1** of the reconcile plan: expose the per-partition event-time rollup 
that is already latent on disk, and stop gating watermark tracking on 
`EVENT_TIME_ORDERING` so freshness observability works for COW / 
`COMMIT_TIME_ORDERING` tables too.
   
   This is purely additive — no commit-metadata key added, no avro schema 
change, no behavior change for tables that have not opted into 
`hoodie.write.track.event.time.watermark`.
   
   ### Summary and Changelog
   
   Today `WriteStatus.markSuccess()` already folds min/max event time into each 
`HoodieWriteStat` (and the avro schema already serializes them per stat 
alongside `partitionPath`). But the only public accessor on 
`HoodieCommitMetadata` is `getMinAndMaxEventTime()`, which collapses every 
partition into a single pair — consumers asking *"how fresh is partition 
dt=2026-05-19?"* have to walk `partitionToWriteStats` themselves.
   
   Watermark tracking is also currently gated on `recordMergeMode == 
EVENT_TIME_ORDERING`, even though freshness observability is independent of 
merge semantics. The result is that COW tables with `COMMIT_TIME_ORDERING` 
silently get no watermark even when the user explicitly opts in.
   
   This PR:
   
   - **Adds `HoodieCommitMetadata.getMinAndMaxEventTimePerPartition()`** — a 
pure aggregation over `partitionToWriteStats` that returns `Map<String, 
Pair<Option<Long>, Option<Long>>>`. Partitions whose stats carry no event time 
at all are omitted (so the map size reflects partitions with freshness data, 
not total partitions written). Min/max within a partition are folded with 
`Math.min` / `Math.max`, mirroring the semantics of the existing global getter. 
No persisted bytes, no avro change.
   - **Decouples watermark tracking from `EVENT_TIME_ORDERING`** in 
`HoodieWriteHandle`. Tracking now activates when `eventTimeFieldName != null && 
hoodie.write.track.event.time.watermark=true`, regardless of merge mode. The 
unused `EVENT_TIME_ORDERING` static import is removed.
   - **Tests:** five new unit tests for the rollup API in 
`TestHoodieCommitMetadata` (folding across stats within a partition, omitting 
partitions without event time, handling partial min/max, empty metadata, and a 
consistency check against the global getter); updates the existing 
`testShouldTrackEventTimeWaterMarkerAvroRecordTypeWithCommitTimeOrdering` to 
assert the new behavior (now tracks) and adds a negative test for the 
missing-event-time-field case.
   
   Full hudi-common (1897 tests) and hudi-client-common (1026 tests) suites 
pass locally.
   
   ### Impact
   
   Public-API addition on `HoodieCommitMetadata`: external tools (catalogs, 
freshness exporters, lineage UIs) can now read per-partition freshness directly 
without walking write stats.
   
   Behavior change for opted-in tables: COW / `COMMIT_TIME_ORDERING` tables 
with `hoodie.write.track.event.time.watermark=true` and an event-time field 
will now populate min/max on write stats; previously they were silently no-op. 
Tables that have *not* set the flag see no change.
   
   No performance impact — the rollup is a pure in-memory aggregation that 
callers invoke on demand; watermark extraction at write time was already gated 
on the same per-record path.
   
   ### Risk Level
   
   low
   
   The new method is additive. The behavior change is conditional on a config 
that is `false` by default and gated on an event-time field name; tables not 
using the flag are unaffected. Verified by running the full `hudi-common` and 
`hudi-client-common` test suites locally with no regressions.
   
   ### Documentation Update
   
   The `hoodie.write.track.event.time.watermark` config description should be 
updated on the Hudi website to reflect that it no longer requires 
`EVENT_TIME_ORDERING`. The new `getMinAndMaxEventTimePerPartition()` API is 
internally documented via Javadoc; a website page covering per-partition 
freshness consumption can land alongside Phase 2 (upstream propagation) so 
users see the end-to-end story in one place.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to