shangxinli opened a new issue, #17512:
URL: https://github.com/apache/hudi/issues/17512

   ### Feature Description
   
   **Summary**
   
   This feature introduces **commit-level, partition-scoped freshness 
metadata** in Apache Hudi.
   Freshness metrics (e.g., min/max source event time) are recorded as commit 
metadata, allowing users and downstream systems to reason about **data 
freshness independently of commit time**.
   
   Currently, _hoodie_commit_time only reflects when data was written, not when 
source data was produced. This is insufficient for freshness SLAs, debugging 
pipeline delays, or explaining partition-level freshness differences.
   
   This proposal adds a **storage-layer primitive** for freshness tracking that 
is **engine-agnostic, backward-compatible, and opt-in**.
   
   **Motivation**
   
   Users need a reliable way to answer:
   
   - How fresh is a specific partition?
   - Which partitions are lagging and why?
   - Is freshness degrading across derived tables?
   
   Existing approaches rely on external systems or expensive scans and lack 
atomicity with Hudi commits.
   
   **Design Overview**
   
   - Freshness is observed during write, not inferred later
   - Freshness is aggregated per destination partition
   - Freshness is persisted atomically with the commit
   - Hudi itself does not scan source tables or require query engine changes
   
   **Freshness Signal Selection (Simplified Rule)**
   
   Freshness metrics are selected using the **best available signal**, in 
strict order:
   
   - Propagated freshness metadata from upstream Hudi commits (when present)
   - Min/max aggregation of a configured event-time column during write
   - No freshness metadata if neither signal is available
   
   This avoids heuristic inference and works uniformly for raw and derived 
tables.
   
   **Commit Metadata Extension**
   
   Freshness metrics are stored under extraMetadata in the commit:
   
   ```
   {
     "extraMetadata": {
       "hoodie.source.freshness": {
         "partition=dt=2025-12-01": {
           "min_event_time": "2025-12-01T10:28:00Z",
           "max_event_time": "2025-12-01T10:31:00Z"
         }
       }
     }
   }
   ```
   
   - Metadata is optional and additive
   - Immutable once committed
   - Ignored safely by older readers
   
   ### User Experience
   
   **How users will use this feature**
   
   This feature is **opt-in**. Existing pipelines continue to function 
unchanged.
   
   Configuration Changes (Optional)
   ```
   hoodie.source.freshness.enable=true
   hoodie.source.freshness.event.time.field=ts
   ```
   
   If disabled, no freshness metrics are collected or written.
   
   **Behavior by Use Case**
   
   **Raw ingestion tables (e.g., Kafka → Hudi)**
   
   - Freshness computed from min/max of the configured event-time column
   
   **Derived / transformed tables**
   
   - If upstream freshness metadata exists, it is propagated automatically
   - No requirement to retain event-time columns in derived schemas
   
   **Pipelines without temporal signals**
   
   - No freshness metadata is written (explicitly missing, not inferred)
   
   How users access freshness information
   - Via Hudi timeline / commit metadata APIs
   - Via external tooling or observability systems
   - No new SQL functions or query engine changes included in this RFC
   
   ### Hudi RFC Requirements
   
   **Non-Goals**
   
   This RFC explicitly does **not**:
   
   - Change table schemas
   - Require SQL or planner changes
   - Enforce freshness SLAs
   - Infer missing timestamps
   - Introduce per-record freshness tracking
   
   **Backward Compatibility**
   
   - Commit metadata extension is additive
   - No file format or schema changes
   - No behavioral changes when feature is disabled
   
   Fully backward compatible.
   
   **Alternatives Considered**
   
   **Table columns for freshness**
   Rejected due to schema impact and merge complexity.
   
   **Query-engine computation**
   Rejected due to non-determinism and engine coupling.
   
   **External SLA systems**
   Rejected due to lack of atomicity and replayability.
   
   **Future Work (Out of Scope)**
   
   - Standardized freshness schema
   - Metadata table integration
   - Optional SQL exposure
   - Multi-table freshness lineage visualization
   
   **Summary**
   
   This proposal adds a **minimal, deterministic, and engine-independent** 
mechanism for tracking data freshness in Apache Hudi by leveraging commit 
metadata.
   
   It provides meaningful freshness observability while preserving Hudi’s core 
design principles: immutability, backward compatibility, and separation of 
storage and compute concerns.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to