[GitHub] [hudi] danny0405 commented on pull request #5436: [RFC-51] [HUDI-3478] Change Data Capture RFC

GitBox Wed, 27 Apr 2022 05:37:55 -0700


danny0405 commented on PR #5436:
URL: https://github.com/apache/hudi/pull/5436#issuecomment-1110950019


   > > left some initial comments. I think the main decision here is whether or 
not to reuse the existing record level commit metadata and build CDC on top or 
do a separate `.cdc` folder? Can you clarify what exactly is contained in the 
files under .cdc.?
   > 
   > Sorry for leaving some points that i can't make clear in this RFC doc. let 
me mention them here, and i'll update RFC later.
   > 
   > 1. for COW tables, query efficiency is the main focus. I definitely do not 
want to write out the log files, if i have to persist the CDC data. So it has 
to, i prefer to double-write. But i will try to reuse the normal data files, 
and reduce extra workload. And answer the question above: `.cdc` folder will 
keep these files that we have to write out.
   > 2. for MOR tables, we care about the write efficiency. In my thoughts and 
design, i try to avoid to write any more data and files. But in some cases, for 
example, call the `HoodieMergeHandle` to execute the data writing, then will 
write out a base file rather than a log file. I have to write CDC file to 
record the changing. And when query CDC for MOR, we need to merge inc data 
written in log Files and base files to judge which records are deleted, which 
ones are updated (for those, we also need to find the previous values), and 
which ones are inserted.
   
   I general, the design guidelines to consider at 1st priority is not to 
double write, for these reasons:
   
   1. The CDC details records occupies several times the storage cost than the 
actual base data files. This is not acceptable for production, especially for 
lake format, we already have active timeline commits for history snapshots;
   2. The double write would reduce the write throughput obviously;
   3. If we double write that log files, we need to handle the transaction for 
the data file completeness and the CDC logs, for example, how about we write 
the log success but the data files failed, should we failover, and how we do 
the failover ? Recover from the log files ? There are many corner cases to 
handle just like we did to metadata table already.
   4. What about the TTL of the log files, should it be separate managed from 
the data files ? Say we keep 10 latest commits for data files, should we also 
keep that for log files ? How to clean them, and which component to clean them 
? The existing cleaning service ? Note that the log data set is huge and the 
cleaning should be enough efficient.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] danny0405 commented on pull request #5436: [RFC-51] [HUDI-3478] Change Data Capture RFC

Reply via email to