nsivabalan commented on code in PR #13499:
URL: https://github.com/apache/hudi/pull/13499#discussion_r2279943017


##########
rfc/rfc-97/rfc-97.md:
##########
@@ -0,0 +1,184 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-97: Deprecate Hudi Payload Class Usage
+
+## Proposers
+
+*   Lin Liu
+
+## Approvers
+
+*   Ethan Guo
+*   Sivabalan Narayanan
+*   Vinoth Chandar
+
+## Status
+
+JIRA: HUDI-9560
+
+---
+
+# Motivation
+
+During reads, Hudi currently supports three distinct mechanisms to merge 
records at runtime:
+
+* **Via APIs provided by `HoodieRecordPayload`** – Legacy interface enabling 
users to plug in merge logic.
+* **Through pluggable merger implementations inheriting from 
`HoodieRecordMerger`** – Newer and more composable approach introduced to 
separate merge semantics from payload definitions.
+* **By configuring merge modes such as `COMMIT_TIME_ORDERING` or 
`EVENT_TIME_ORDERING`** – Recommended declarative approach for most standard 
use cases.
+
+The `HoodieRecordPayload` abstraction was once necessary to encapsulate merge 
semantics, especially before Hudi had consistent event time handling, watermark 
metadata, and standard schema evolution support. However, over the years, the 
payload interface has become a limiting factor:
+
+* It’s tightly coupled with the write path, making it hard to optimize 
read/write independently.
+* Many of the behaviors (e.g., null handling, default values) are 
re-implemented inconsistently in various payloads.
+* It breaks composability — writers like HoodieStreamer, DeltaStreamer, and 
Spark SQL have different expectations about payload behavior.
+* It’s difficult to evolve and maintain, especially with increasing user needs 
(e.g., CDC ingestion, partial updates, deduplication).
+
+This RFC proposes to **deprecate the usage of `HoodieRecordPayload`**, 
encourage standard declarative merge modes, and move towards cleanly defined, 
testable `HoodieRecordMerger` implementations for custom logic. This aligns 
Hudi with modern lakehouse expectations and simplifies the ecosystem 
significantly.
+
+---
+
+# Requirements
+
+* **Declarative merge semantics** are enforced via `RecordMergeMode`, which 
cover 90%+ of industry use cases.
+* **Partial update semantics** (null-handling, default-filling, etc.) are 
captured as part of merge mode behavior via `hoodie.write.partial.update.mode`.
+* **Custom merge logic**, when required, is implemented through 
`HoodieRecordMerger` instances and configured via table properties.
+* **Legacy payloads must still function**, especially for large installations 
using Hudi for multi-year tables (e.g., in fintech, retail, health tech).
+* **All writers (SQL, HoodieStreamer, Flink, Java client)** should migrate 
toward payload-less workflows, even if they need a transition layer.
+* **Minimal to no changes for readers** (Presto/Trino/Spark SQL) reading table 
version <9.
+
+---
+
+## Payload and Writer Usages Callout
+
+Payload-based write paths today are highly fragmented:
+
+* **`MySqlDebeziumAvroPayload` / `PostgresDebeziumAvroPayload`** are often 
used with HoodieStreamer + Avro transformer. They assume CDC structure and 
extract metadata from nested fields. These aren’t portable to Spark SQL or Java 
client directly.
+* **`ExpressionPayload`** is used only within Spark SQL engine (e.g., 
`update(...) set ... where ...`). It doesn’t work in HoodieStreamer or bulk 
insert paths.
+* **Some payloads like `AWSDmsAvroPayload`** have table-specific logic for 
delete markers and are only functional with certain MoR writers.
+
+These inconsistencies lead to bugs, surprises during upgrades, and poor UX for 
new users. By eliminating the need for payloads, we can:
+
+* Decouple writers from tightly-coupled logic embedded in payloads.
+* Consolidate test coverage and semantics around well-defined 
`RecordMergeMode`s and `PartialUpdateMode`s.
+* Improve future features like lakehouse-wide CDC ingestion, Iceberg 
interoperability, and schema-less streaming.
+
+---
+
+## Partial Update Mode
+
+The new table property `hoodie.table.partial.update.mode=<value>` now controls 
how missing columns are interpreted in a record. This enables flexible logic 
without writing a custom payload or merger.
+
+| Mode              | Description                                              
       |
+| ----------------- | 
--------------------------------------------------------------- |
+| `KEEP_VALUES`     | (default) Use value from previous record if column is 
missing   |
+| `FILL_DEFAULTS`   | Fill missing columns with default values from Avro 
schema       |
+| `IGNORE_DEFAULTS` | Skip update if current record has schema default value   
       |
+| `IGNORE_MARKERS`  | Skip update if current record matches a configured 
marker value |
+
+This config supports:
+
+* Use cases like **Debezium/CDC**, where marker values signify 
unknown/unavailable fields.
+* **Sparse updates** from streaming systems like Kafka, Flink.
+* **Backward-compatible upserts** during schema evolution.
+
+This behavior is now decoupled from merge mode, and supports all ingestion 
sources uniformly.
+
+---
+
+# Payload Migration Table
+
+*(Expanded with context on industry usage and reasoning)*
+
+| Payload Class                               | Merge Mode + Partial Update 
Mode           | Changes Proposed                                               
                                                                                
                                                                                
                                                                    | 
Recommendations to User                                                         
       | Behavior / Notes                                                       
                                          |
+| ------------------------------------------- | 
------------------------------------------ 
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
+| `OverwriteWithLatestAvroPayload`            | `COMMIT_TIME_ORDERING`         
            | Upgrade process sets right merge mode and add legacy payload 
class from table config.                                                        
                                                                                
                                                                      | No 
action                                                                          
    | Most common for bulk ingest. Removing payload makes delete marker support 
consistent across COW/MOR.             |
+| `DefaultHoodieRecordPayload`                | `EVENT_TIME_ORDERING`          
            | Upgrade process sets right merge mode and remove payload class 
from table config. Set 
`hoodie.write.enable.event.time.watermark.in.commit.metadata=true` to produce 
event time watermarks commit metadata.                                          
                                               | No action                      
                                                        | Default since Hudi 
0.5.0; behavior unchanged.                                                      
              |

Review Comment:
   yes, this is not a table property. So, the users have to set this as part of 
writer property if they are expecting to track this in commit metadata



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to