[GitHub] [hudi] YannByron commented on pull request #5436: [RFC-51] [HUDI-3478] Change Data Capture RFC

GitBox Wed, 27 Apr 2022 08:15:26 -0700


YannByron commented on PR #5436:
URL: https://github.com/apache/hudi/pull/5436#issuecomment-1111127124


   > I general, the design guidelines to consider at 1st priority is not to 
double write, for these reasons:
   > 
   > 1. The CDC details records occupies several times the storage cost than 
the actual base data files. This is not acceptable for production, especially 
for lake format, we already have active timeline commits for history snapshots;
   > 2. The double write would reduce the write throughput obviously;
   > 3. If we double write that log files, we need to handle the transaction 
for the data file completeness and the CDC logs, for example, how about we 
write the log success but the data files failed, should we failover, and how we 
do the failover ? Recover from the log files ? There are many corner cases to 
handle just like we did to metadata table already.
   > 4. What about the TTL of the log files, should it be separate managed from 
the data files ? Say we keep 10 latest commits for data files, should we also 
keep that for log files ? How to clean them, and which component to clean them 
? The existing cleaning service ? Note that the log data set is huge and the 
cleaning should be enough efficient.
   
   1. Now, we have two table types: COW and MOR. As a lake format, We need to 
have different concerns for different table types that can use in different 
scenario. As i said above, we should focus more on query performance for COW 
tables, and write performance for MOR tables. Your solution in google doc do 
the same things for both. If i understand your solution correctly, it need a 
full-join to detect the changing for cow.  It is implemented with two 
time-travel queries, i.e, we need to load the two versions of file group, even 
just one record is changed for cow table (at most streaming cases, maybe just a 
very tiny fraction is changed in one commit). 
   2. `the write throughput` is the main point for MOR. At most cases, we do 
not need to write out extra cdc files. The timing at which the CDC files has to 
be generated is when the MOR table will write out the base file, not log file. 
After all, in the normal cases, the MOR table also need to rewrite the base 
file, not always write to log file.
   3. Hudi transaction is managed by timeline. Failure to write CDC files or 
data files should not complete the commit correctly.
   4. The management about log files is as usual. Only CDC files, we need to 
consider to clean them in time by the clean service.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] YannByron commented on pull request #5436: [RFC-51] [HUDI-3478] Change Data Capture RFC

Reply via email to