(hudi) branch asf-site updated: [DOCS] Added blog for record mergers in hudi (#12904)

bhavanisudha Tue, 04 Mar 2025 03:45:18 -0800

This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 2c802a31224 [DOCS] Added blog for  record mergers in hudi  (#12904)
2c802a31224 is described below

commit 2c802a312242ebe64a83c9a20b8e3b8a32e502ef
Author: Aditya Goenka <[email protected]>
AuthorDate: Tue Mar 4 17:13:47 2025 +0530

    [DOCS] Added blog for  record mergers in hudi  (#12904)
    
    * Added record merger blog
    
    * updated image
---
 .../2025-03-03-record-mergers-in-apache-hudi.png   | Bin 0 -> 163237 bytes
 website/blog/2025-03-03-record-mergers-in-hudi.mdx | 164 +++++++++++++++++++++
 2 files changed, 164 insertions(+)

diff --git 
a/content/assets/images/blog/2025-03-03-record-mergers-in-apache-hudi.png 
b/content/assets/images/blog/2025-03-03-record-mergers-in-apache-hudi.png
new file mode 100644
index 00000000000..162622904ee
Binary files /dev/null and 
b/content/assets/images/blog/2025-03-03-record-mergers-in-apache-hudi.png differ
diff --git a/website/blog/2025-03-03-record-mergers-in-hudi.mdx 
b/website/blog/2025-03-03-record-mergers-in-hudi.mdx
new file mode 100644
index 00000000000..6ee14571eed
--- /dev/null
+++ b/website/blog/2025-03-03-record-mergers-in-hudi.mdx
@@ -0,0 +1,164 @@
+---
+title: "Record Mergers in Apache Hudi"
+excerpt: "Explain need for record mergers in Apache Hudi and implemenation 
details"
+author: Aditya Goenka
+category: blog
+image: /assets/images/blog/2025-03-03-record-mergers-in-apache-hudi.png
+tags:
+- Data Lake
+- Data Lakehouse
+- Apache Hudi
+- Record Mergers
+- Record payloads
+- Late Arriving Data
+---
+
+## The Challenge of Unordered Streams of Events
+One of the primary challenges associated with streaming workloads is the 
unordered nature of incoming events. In a typical streaming scenario, events 
can arrive out of sequence due to network latency, processing delays, or other 
factors. With the increasing volume and velocity of data being ingested from 
various sources—especially in mobile applications and IoT platforms—data 
processing frameworks must be equipped to handle mutations (i.e., changes to 
records) and out-of-order events.
+Traditional data storage systems and file formats, such as those optimized for 
batch processing, often struggle to manage these scenarios effectively. Hudi 
steps in with features specifically designed to handle such challenges.
+When events or changes to a record arrive at different times, they may not be 
in the same order in which they were originally generated. For example, in a 
smart city traffic monitoring system, sensors may report vehicle speeds at 
various intersections in real-time. However, due to network issues or delays, 
some sensor data might arrive later than others, possibly out of order. To 
handle this, the system needs to merge the new incoming data with existing 
records efficiently. Just like how [...]
+This can lead to several issues:
+#### Data Integrity
+When events are processed out of order, it can result in incorrect or 
inconsistent data states. For example, if an event representing a transaction 
is processed before the event that indicates the account balance, the resulting 
data may not accurately reflect the true state of the system.
+#### Complexity in Processing
+Handling unordered events often requires additional logic to ensure that data 
is processed in the correct sequence. This can complicate the data pipeline and 
increase the likelihood of errors.
+
+## What are Record Mergers
+With the new api introduced with version 1.0.0, Hudi supports three primary 
merge modes, each suited to different stages of data processing: writing, 
compaction, and querying.
+4 places/points of data processing [Subheader]
+#### 1. Merging input data before writing : Combining Change Records During 
Writes
+When new data arrives for an existing record, Hudi performs deduplication on 
the input dataset. This process involves combining multiple change records for 
the same record key before the write phase. This is an optimization that also 
helps reduce the number of records written to the log files (in case of MOR). 
By merging changes upfront, Hudi reduces unnecessary records, improving the 
efficiency of both query and write operations.
+This step is crucial for handling stream data in real-time, where changes may 
arrive rapidly, and ensuring that only the final version of the record is 
written into the system. Normally these out of order events come together 
commonly in the same batch,  With processing engines like spark, which deals 
with micro-batches, merging the input changes helps in reduces the number of 
records which needs to be written.
+#### 2. Merging Final Change Record in CoW (Copy-on-Write) Tables: Applying 
Changes to Existing Records
+In Copy-on-Write (CoW) tables, changes are applied by creating new file 
versions for the records. When an update, partial update, or delete operation 
occurs, Hudi will merge this final change with the existing record in the 
storage. The merge mode controls how these updates are applied, ensuring that 
only the most recent changes are reflected and the table’s data remains 
consistent.
+This is especially important in CoW tables, as they preserve immutability of 
historical data by writing new versions of the records instead of overwriting 
the existing data. The merge mode ensures that the new version of the record is 
consistent with all previous changes.
+#### 3. Compaction Merge in MoR (Merge-on-Read) Tables : Merging Log Files 
with Base Files
+Hudi uses a concept of log files (delta logs) and base files (original data). 
As changes to records accumulate over time, Hudi’s compaction service merges 
the change records stored in the log files with the base files to keep the data 
consistent and query-optimized. The merge mode defines how these log records 
are merged with base files during the compaction process.
+Compaction helps maintain storage efficiency and ensures that queries run 
faster by reducing the number of small log files that might need to be read.
+#### 4. Query-Time Merge: Merging Log Files with Base Files in MoR 
(Merge-on-Read) Tables
+In Merge-on-Read (MoR) tables, the data is stored in both log files and base 
files. When a query is executed, Hudi merges the change records in the log 
files with the base files based on the merge mode. The merge operation occurs 
at query time to provide the final, consistent view of the data.
+By merging records at query time, Hudi ensures that queries reflect the most 
recent changes while maintaining query performance.
+
+
+## Implementation
+In common scenarios, the input data contains a field that can be used to 
identify the latest record. Typically, tables have fields like updated_at or 
other ordering columns. If no such column is present in the input, we are 
limited to relying on the incoming order.
+
+After the release of Hudi 1.0.0, a new configuration, 
[hoodie.record.merge.mode](https://hudi.apache.org/docs/configurations/#hoodierecordmergemode)
 was introduced to define the merge modes responsible for handling record 
updates. These merge modes dictate how records with the same key are processed 
at different stages of the pipeline, from data ingestion to query results.
+It can have the following three values:
+
+#### 1. COMMIT_TIME_ORDERING
+This merge mode is used when no field is available in the input data to 
explicitly determine which record is the latest. The system will rely on the 
order of ingestion (commit time) to determine the order of records. Hudi 
expects records to arrive in strict order of their commits. So, the most recent 
record (in terms of ingestion time) is assumed to be the latest version of the 
record. This mode is typically used when there is no dedicated column like 
updated_at, timestamp, or versioning [...]
+The merging logic here simply picks the latest write based on the ingestion 
order (commit time). In a way, it's equivalent to overwriting semantics where 
only the most recent record is considered.
+Example -
+```sql
+SET hoodie.spark.sql.insert.into.operation=upsert;
+CREATE TABLE hudi_table (
+    ts BIGINT,
+    uuid STRING,
+    rider STRING,
+    driver STRING,
+    fare DOUBLE,
+    city STRING
+) USING HUDI TBLPROPERTIES (primaryKey = 'uuid', 
hoodie.record.merge.mode='COMMIT_TIME_ORDERING');
+
+INSERT INTO hudi_table
+VALUES
+(3,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco'),
+(2,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-C','driver-M',27.70 
,'san_francisco');
+
+select * from hudi_table;
+-- Result - 20250106162911278  20250106162911278_0_0   
334e26e9-8355-45cc-97c6-c31daf0df330            
08218473-f72a-480d-90e6-c6764f062e5c-0_0-43-47_20250106162911278.parquet        
1695091554788   334e26e9-8355-45cc-97c6-c31daf0df330    rider-C driver-M        
27.7    san_francisco
+
+INSERT INTO hudi_table
+VALUES
+(1,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-D','driver-K',19.10,'san_francisco');
+
+select * from hudi_table;
+-- Result - 20250106163449812  20250106163449812_0_0   
334e26e9-8355-45cc-97c6-c31daf0df330            
08218473-f72a-480d-90e6-c6764f062e5c-0_0-71-68_20250106163449812.parquet        
1       334e26e9-8355-45cc-97c6-c31daf0df330    rider-D driver-K        19.1    
san_francisco
+```
+
+In the example above, we created the table using the COMMIT_TIME_ORDERING 
merge mode. When using this mode, there is no need to specify a precombine or 
ordering field.
+During the first insert, two records with the same record key are provided. 
The system will deduplicate them and keep the record that is processed later.
+In the second insert, a new record with the same record key is inserted. As 
expected, the table is updated with the new record because it is committed 
later, regardless of the values in any of the fields.
+
+#### 2. EVENT_TIME_ORDERING (DEFAULT)
+This merge mode is used when you do have a field in the input data that can be 
used to determine the order of events (such as a timestamp field like 
updated_at or a version number). If your records contain a field that can be 
used to track when the record was last updated (e.g., updated_at, 
last_modified, or a sequence number), Hudi will use this field to determine 
which record is the latest.
+In this case, Hudi does not rely on the ingestion order but instead uses the 
value of the ordering field (updated_at, for example) to decide the correct 
record.
+This approach is ideal when you have temporal or event-driven data, and you 
want to maintain the "latest" record according to an event timestamp.
+Example -
+```sql
+DROP TABLE hudi_table;
+SET hoodie.spark.sql.insert.into.operation=upsert;
+
+CREATE TABLE hudi_table (
+    ts BIGINT,
+    uuid STRING,
+    rider STRING,
+    driver STRING,
+    fare DOUBLE,
+    city STRING
+) USING HUDI TBLPROPERTIES (primaryKey = 'uuid',preCombineField = 'ts', 
hoodie.record.merge.mode='EVENT_TIME_ORDERING');
+
+INSERT INTO hudi_table
+VALUES
+(3,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco'),
+(2,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-C','driver-M',27.70 
,'san_francisco');
+
+select * from hudi_table;
+-- Result - 20250106165902806  20250106165902806_0_0   
334e26e9-8355-45cc-97c6-c31daf0df330            
568ce7bc-9b71-4e15-b557-cbaeb5b4d2ea-0_0-56-57_20250106165902806.parquet        
3       334e26e9-8355-45cc-97c6-c31daf0df330    rider-A driver-K        19.1    
san_francisco
+
+INSERT INTO hudi_table
+VALUES
+(1,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-D','driver-K',18.00,'san_francisco');
+
+select * from hudi_table;
+-- Result - 20250106165902806  20250106165902806_0_0   
334e26e9-8355-45cc-97c6-c31daf0df330            
568ce7bc-9b71-4e15-b557-cbaeb5b4d2ea-0_0-84-78_20250106165918731.parquet        
3       334e26e9-8355-45cc-97c6-c31daf0df330    rider-A driver-K        19.1    
san_francisco
+```
+
+In the example above, we created the table using the EVENT_TIME_ORDERING merge 
mode. When using this mode, we need to specify the precombineField. In this 
case we are specifying ts as the precombineField.
+During the first insert, two records with the same record key are provided. 
The system will deduplicate them and keep the record that is processed later.
+In the second insert, a new record with the same record key is inserted. As 
expected, the table is updated with the new record because it is committed 
later, regardless of the values in any of the fields.
+
+#### 3. CUSTOM
+For more complex use-case sometimes prior discussed merging modes won’t work. 
We may need to implement a use-case specific merging logic.
+The details for the implementation is provided here  - 
https://hudi.apache.org/docs/record_merger/#custom
+
+## Record Payloads
+Pre 1.0.0, Hudi uses the legacy Record Payload API, Please refer to the 
[Record Payloads](https://hudi.apache.org/docs/record_merger/#record-payloads) 
section to know about the implementation and some of the existing record 
payloads.
+
+Along with the existing payloads, Hudi provides flexibility to implement the 
custom record payload by implementing the 
[HoodieRecordPayload](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordPayload.java)
 interface
+
+The following example demonstrates the use of Record Payload, which achieves a 
similar outcome to what EVENT_TIME_ORDERING does. We’ve used the same example 
as above to illustrate how this functionality works.
+
+```sql
+DROP TABLE hudi_table;
+SET hoodie.spark.sql.insert.into.operation=upsert;
+
+CREATE TABLE hudi_table (
+    ts BIGINT,
+    uuid STRING,
+    rider STRING,
+    driver STRING,
+    fare DOUBLE,
+    city STRING
+) USING HUDI TBLPROPERTIES (primaryKey = 'uuid',preCombineField = 'ts', 
hoodie.datasource.write.payload.class='org.apache.hudi.common.model.DefaultHoodieRecordPayload');
+
+INSERT INTO hudi_table
+VALUES
+(3,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco'),
+(2,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-C','driver-M',27.70 
,'san_francisco');
+
+select * from hudi_table;
+-- Result - 20250203164444124  20250203164444124_0_0   
334e26e9-8355-45cc-97c6-c31daf0df330            
4549ed8e-0346-4d59-8878-9e047fb6c651-0_0-14-17_20250203164444124.parquet        
3       334e26e9-8355-45cc-97c6-c31daf0df330    rider-A driver-K        19.1    
san_francisco
+
+
+INSERT INTO hudi_table
+VALUES
+(1,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-D','driver-K',18.00,'san_francisco');
+
+select * from hudi_table;
+-- Result - 20250203164444124  20250203164444124_0_0   
334e26e9-8355-45cc-97c6-c31daf0df330            
4549ed8e-0346-4d59-8878-9e047fb6c651-0_0-53-51_20250203164537068.parquet        
3       334e26e9-8355-45cc-97c6-c31daf0df330    rider-A driver-K        19.1    
san_francisco
+```
+
+## Conclusion
+In conclusion, managing late-arriving and out-of-order data is a critical 
challenge in modern data processing systems, especially when dealing with 
large-scale, real-time data pipelines. Tools like Hudi provide powerful merge 
modes that ensure data consistency, accuracy, and efficiency by handling record 
updates intelligently across different stages of the pipeline. Whether you're 
working with streaming data, IoT sensors, or social media posts, understanding 
how to configure and use thes [...]

(hudi) branch asf-site updated: [DOCS] Added blog for record mergers in hudi (#12904)

Reply via email to