nsivabalan commented on code in PR #13499: URL: https://github.com/apache/hudi/pull/13499#discussion_r2279943017
########## rfc/rfc-97/rfc-97.md: ########## @@ -0,0 +1,184 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# RFC-97: Deprecate Hudi Payload Class Usage + +## Proposers + +* Lin Liu + +## Approvers + +* Ethan Guo +* Sivabalan Narayanan +* Vinoth Chandar + +## Status + +JIRA: HUDI-9560 + +--- + +# Motivation + +During reads, Hudi currently supports three distinct mechanisms to merge records at runtime: + +* **Via APIs provided by `HoodieRecordPayload`** – Legacy interface enabling users to plug in merge logic. +* **Through pluggable merger implementations inheriting from `HoodieRecordMerger`** – Newer and more composable approach introduced to separate merge semantics from payload definitions. +* **By configuring merge modes such as `COMMIT_TIME_ORDERING` or `EVENT_TIME_ORDERING`** – Recommended declarative approach for most standard use cases. + +The `HoodieRecordPayload` abstraction was once necessary to encapsulate merge semantics, especially before Hudi had consistent event time handling, watermark metadata, and standard schema evolution support. However, over the years, the payload interface has become a limiting factor: + +* It’s tightly coupled with the write path, making it hard to optimize read/write independently. +* Many of the behaviors (e.g., null handling, default values) are re-implemented inconsistently in various payloads. +* It breaks composability — writers like HoodieStreamer, DeltaStreamer, and Spark SQL have different expectations about payload behavior. +* It’s difficult to evolve and maintain, especially with increasing user needs (e.g., CDC ingestion, partial updates, deduplication). + +This RFC proposes to **deprecate the usage of `HoodieRecordPayload`**, encourage standard declarative merge modes, and move towards cleanly defined, testable `HoodieRecordMerger` implementations for custom logic. This aligns Hudi with modern lakehouse expectations and simplifies the ecosystem significantly. + +--- + +# Requirements + +* **Declarative merge semantics** are enforced via `RecordMergeMode`, which cover 90%+ of industry use cases. +* **Partial update semantics** (null-handling, default-filling, etc.) are captured as part of merge mode behavior via `hoodie.write.partial.update.mode`. +* **Custom merge logic**, when required, is implemented through `HoodieRecordMerger` instances and configured via table properties. +* **Legacy payloads must still function**, especially for large installations using Hudi for multi-year tables (e.g., in fintech, retail, health tech). +* **All writers (SQL, HoodieStreamer, Flink, Java client)** should migrate toward payload-less workflows, even if they need a transition layer. +* **Minimal to no changes for readers** (Presto/Trino/Spark SQL) reading table version <9. + +--- + +## Payload and Writer Usages Callout + +Payload-based write paths today are highly fragmented: + +* **`MySqlDebeziumAvroPayload` / `PostgresDebeziumAvroPayload`** are often used with HoodieStreamer + Avro transformer. They assume CDC structure and extract metadata from nested fields. These aren’t portable to Spark SQL or Java client directly. +* **`ExpressionPayload`** is used only within Spark SQL engine (e.g., `update(...) set ... where ...`). It doesn’t work in HoodieStreamer or bulk insert paths. +* **Some payloads like `AWSDmsAvroPayload`** have table-specific logic for delete markers and are only functional with certain MoR writers. + +These inconsistencies lead to bugs, surprises during upgrades, and poor UX for new users. By eliminating the need for payloads, we can: + +* Decouple writers from tightly-coupled logic embedded in payloads. +* Consolidate test coverage and semantics around well-defined `RecordMergeMode`s and `PartialUpdateMode`s. +* Improve future features like lakehouse-wide CDC ingestion, Iceberg interoperability, and schema-less streaming. + +--- + +## Partial Update Mode + +The new table property `hoodie.table.partial.update.mode=<value>` now controls how missing columns are interpreted in a record. This enables flexible logic without writing a custom payload or merger. + +| Mode | Description | +| ----------------- | --------------------------------------------------------------- | +| `KEEP_VALUES` | (default) Use value from previous record if column is missing | +| `FILL_DEFAULTS` | Fill missing columns with default values from Avro schema | +| `IGNORE_DEFAULTS` | Skip update if current record has schema default value | +| `IGNORE_MARKERS` | Skip update if current record matches a configured marker value | + +This config supports: + +* Use cases like **Debezium/CDC**, where marker values signify unknown/unavailable fields. +* **Sparse updates** from streaming systems like Kafka, Flink. +* **Backward-compatible upserts** during schema evolution. + +This behavior is now decoupled from merge mode, and supports all ingestion sources uniformly. + +--- + +# Payload Migration Table + +*(Expanded with context on industry usage and reasoning)* + +| Payload Class | Merge Mode + Partial Update Mode | Changes Proposed | Recommendations to User | Behavior / Notes | +| ------------------------------------------- | ------------------------------------------ |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------| +| `OverwriteWithLatestAvroPayload` | `COMMIT_TIME_ORDERING` | Upgrade process sets right merge mode and add legacy payload class from table config. | No action | Most common for bulk ingest. Removing payload makes delete marker support consistent across COW/MOR. | +| `DefaultHoodieRecordPayload` | `EVENT_TIME_ORDERING` | Upgrade process sets right merge mode and remove payload class from table config. Set `hoodie.write.enable.event.time.watermark.in.commit.metadata=true` to produce event time watermarks commit metadata. | No action | Default since Hudi 0.5.0; behavior unchanged. | Review Comment: yes, this is not a table property. So, the users have to set this as part of writer property if they are expecting to track this in commit metadata -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
