[GitHub] [hudi] YannByron commented on a diff in pull request #5885: [RFC-51][HUDI-3478] Hudi CDC

GitBox Mon, 18 Jul 2022 00:37:21 -0700


YannByron commented on code in PR #5885:
URL: https://github.com/apache/hudi/pull/5885#discussion_r923045845



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##########
@@ -281,7 +313,18 @@ private boolean writeUpdateRecord(HoodieRecord<T> 
hoodieRecord, GenericRecord ol
         return false;
       }
     }
-    return writeRecord(hoodieRecord, indexedRecord, isDelete);
+    boolean result = writeRecord(hoodieRecord, indexedRecord, isDelete);
+    if (cdcEnabled) {
+      if (indexedRecord.isPresent()) {
+        GenericRecord record = (GenericRecord) indexedRecord.get();
+        cdcData.add(cdcRecord(CDCOperationEnum.UPDATE, 
hoodieRecord.getRecordKey(), hoodieRecord.getPartitionPath(),

Review Comment:
   IMO, it's ok. 
   A base parquet file is about 128M in most common cases. Even if all the 
records is updated, the `cdcData` will take the memory that is less that about 
300M. And if the workflow is heavy, user can increase the memory of workers.
   But If we are worry about this, use the 
`hudi.common.util.collection.ExternalSpillableMap` instead of this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] YannByron commented on a diff in pull request #5885: [RFC-51][HUDI-3478] Hudi CDC

Reply via email to