[GitHub] [hudi] YannByron commented on pull request #5436: [RFC-51] [HUDI-3478] Change Data Capture RFC

GitBox Wed, 27 Apr 2022 02:21:57 -0700


YannByron commented on PR #5436:
URL: https://github.com/apache/hudi/pull/5436#issuecomment-1110771677


   > left some initial comments. I think the main decision here is whether or 
not to reuse the existing record level commit metadata and build CDC on top or 
do a separate `.cdc` folder? Can you clarify what exactly is contained in the 
files under .cdc.?
   
   Sorry for leaving some points that i can't make clear in this RFC doc. let 
me mention them here, and i'll update RFC later.
   
   1. for COW tables, query efficiency is the main focus. I definitely do not 
want to write out the log files, if i have to persist the CDC data. So it has 
to, i prefer to double-write. But i will try to reuse the normal data files, 
and reduce extra workload. And answer the question above: `.cdc` folder will 
keep these files that we have to write out.
   
   2. for MOR tables, we care about the write efficiency. In my thoughts and 
design, we don't have to write any more data and files. When query CDC for MOR, 
we need to merge inc data written in log Files and base files to judge which 
records are deleted, which ones are updated (for those, we also need to find 
the previous values), and which ones are inserted.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] YannByron commented on pull request #5436: [RFC-51] [HUDI-3478] Change Data Capture RFC

Reply via email to