danny0405 commented on PR #5436: URL: https://github.com/apache/hudi/pull/5436#issuecomment-1110950019
> > left some initial comments. I think the main decision here is whether or not to reuse the existing record level commit metadata and build CDC on top or do a separate `.cdc` folder? Can you clarify what exactly is contained in the files under .cdc.? > > Sorry for leaving some points that i can't make clear in this RFC doc. let me mention them here, and i'll update RFC later. > > 1. for COW tables, query efficiency is the main focus. I definitely do not want to write out the log files, if i have to persist the CDC data. So it has to, i prefer to double-write. But i will try to reuse the normal data files, and reduce extra workload. And answer the question above: `.cdc` folder will keep these files that we have to write out. > 2. for MOR tables, we care about the write efficiency. In my thoughts and design, i try to avoid to write any more data and files. But in some cases, for example, call the `HoodieMergeHandle` to execute the data writing, then will write out a base file rather than a log file. I have to write CDC file to record the changing. And when query CDC for MOR, we need to merge inc data written in log Files and base files to judge which records are deleted, which ones are updated (for those, we also need to find the previous values), and which ones are inserted. I general, the design guidelines to consider at 1st priority is not to double write, for these reasons: 1. The CDC details records occupies several times the storage cost than the actual base data files. This is not acceptable for production, especially for lake format, we already have active timeline commits for history snapshots; 2. The double write would reduce the write throughput obviously; 3. If we double write that log files, we need to handle the transaction for the data file completeness and the CDC logs, for example, how about we write the log success but the data files failed, should we failover, and how we do the failover ? Recover from the log files ? There are many corner cases to handle just like we did to metadata table already. 4. What about the TTL of the log files, should it be separate managed from the data files ? Say we keep 10 latest commits for data files, should we also keep that for log files ? How to clean them, and which component to clean them ? The existing cleaning service ? Note that the log data set is huge and the cleaning should be enough efficient. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
