shangxinli opened a new issue, #17512:
URL: https://github.com/apache/hudi/issues/17512
### Feature Description
**Summary**
This feature introduces **commit-level, partition-scoped freshness
metadata** in Apache Hudi.
Freshness metrics (e.g., min/max source event time) are recorded as commit
metadata, allowing users and downstream systems to reason about **data
freshness independently of commit time**.
Currently, _hoodie_commit_time only reflects when data was written, not when
source data was produced. This is insufficient for freshness SLAs, debugging
pipeline delays, or explaining partition-level freshness differences.
This proposal adds a **storage-layer primitive** for freshness tracking that
is **engine-agnostic, backward-compatible, and opt-in**.
**Motivation**
Users need a reliable way to answer:
- How fresh is a specific partition?
- Which partitions are lagging and why?
- Is freshness degrading across derived tables?
Existing approaches rely on external systems or expensive scans and lack
atomicity with Hudi commits.
**Design Overview**
- Freshness is observed during write, not inferred later
- Freshness is aggregated per destination partition
- Freshness is persisted atomically with the commit
- Hudi itself does not scan source tables or require query engine changes
**Freshness Signal Selection (Simplified Rule)**
Freshness metrics are selected using the **best available signal**, in
strict order:
- Propagated freshness metadata from upstream Hudi commits (when present)
- Min/max aggregation of a configured event-time column during write
- No freshness metadata if neither signal is available
This avoids heuristic inference and works uniformly for raw and derived
tables.
**Commit Metadata Extension**
Freshness metrics are stored under extraMetadata in the commit:
```
{
"extraMetadata": {
"hoodie.source.freshness": {
"partition=dt=2025-12-01": {
"min_event_time": "2025-12-01T10:28:00Z",
"max_event_time": "2025-12-01T10:31:00Z"
}
}
}
}
```
- Metadata is optional and additive
- Immutable once committed
- Ignored safely by older readers
### User Experience
**How users will use this feature**
This feature is **opt-in**. Existing pipelines continue to function
unchanged.
Configuration Changes (Optional)
```
hoodie.source.freshness.enable=true
hoodie.source.freshness.event.time.field=ts
```
If disabled, no freshness metrics are collected or written.
**Behavior by Use Case**
**Raw ingestion tables (e.g., Kafka → Hudi)**
- Freshness computed from min/max of the configured event-time column
**Derived / transformed tables**
- If upstream freshness metadata exists, it is propagated automatically
- No requirement to retain event-time columns in derived schemas
**Pipelines without temporal signals**
- No freshness metadata is written (explicitly missing, not inferred)
How users access freshness information
- Via Hudi timeline / commit metadata APIs
- Via external tooling or observability systems
- No new SQL functions or query engine changes included in this RFC
### Hudi RFC Requirements
**Non-Goals**
This RFC explicitly does **not**:
- Change table schemas
- Require SQL or planner changes
- Enforce freshness SLAs
- Infer missing timestamps
- Introduce per-record freshness tracking
**Backward Compatibility**
- Commit metadata extension is additive
- No file format or schema changes
- No behavioral changes when feature is disabled
Fully backward compatible.
**Alternatives Considered**
**Table columns for freshness**
Rejected due to schema impact and merge complexity.
**Query-engine computation**
Rejected due to non-determinism and engine coupling.
**External SLA systems**
Rejected due to lack of atomicity and replayability.
**Future Work (Out of Scope)**
- Standardized freshness schema
- Metadata table integration
- Optional SQL exposure
- Multi-table freshness lineage visualization
**Summary**
This proposal adds a **minimal, deterministic, and engine-independent**
mechanism for tracking data freshness in Apache Hudi by leveraging commit
metadata.
It provides meaningful freshness observability while preserving Hudi’s core
design principles: immutability, backward compatibility, and separation of
storage and compute concerns.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]