YannByron commented on PR #5436:
URL: https://github.com/apache/hudi/pull/5436#issuecomment-1111670084

   @vinothchandar
   
   To share how i think about the CDC scenario:
   The CDC scenario should be **a whole pipeline** that maybe from ODW to DWD, 
to DWS, to the other downstream. It's not just **a simple, single read/write 
job**. So we can't just focus on the upstream write efficiency, and ignore the 
downstream query efficiency. We need to balance them, should consider the 
problem from the perspective of the whole incremental warehouse.
   
   Back to this solution, in the cases that we have to write out CDC files, no 
matter whether the table is mor or cow, i think at most streaming scenes, just 
a fraction of data need to be inserted/updated/deleted. Let me make it a little 
bit clearer with numbers. There is a file with 100 records, and 5 records of 
these will be changed. Then we rewrite a base file in which 95 records have 
been kept and 5 records have been changed. But, just 5 records will be written 
out the CDC files. i think this can be acceptable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to