vinothchandar commented on PR #5436:
URL: https://github.com/apache/hudi/pull/5436#issuecomment-1116735024

   @danny0405 @YannByron 
   
   I see the major sticking point is - 
   
   Option A) separate `.cdc` folder, that contains the CDC log (similar to redo 
logs in databases)
   Option B) doing it using `_hoodie_operation` flag, effectively the CDC log 
is stored inline with the data.
   
   Few considerations that I think did not come across that well in the 
discussion above. 
   
   @YannByron 
   
   - Danny's concerns around double writing to the .cdc log and then the actual 
data file centers not around transactionality, but all the work we need to now 
do to implement skipping of such partial data written in the CDC read path. 
This comes for free if we leverage the data files (base/log) as the CDC log 
itself.  
   - Also similar on the management of `.cdc` i.e cleaning, clustering, small 
files - every problem we solve for data files, we need to solve for CDC log 
again 
   - Writing every byte 2 times (even if the data changing every day is a 
fraction of the total table size), will cost multi-million dollars more for 
existing large users! (I am not exaggerating here :), I have seen what some 
hudi user's scales look like). This is Danny's main concern as well I think.
   
   @danny0405 
   - Yann's concerns are around the cost of "joining" different file slices 
together to generate the CDC stream, which is a valid concern as well. There is 
more "compute" cost paid per CDC query in this approach. 
   
   If you press me, I am still leaning on Option B and do it inline (i.e trade 
off the simplicity of implementation + reduced storage cost) over potentially 
(I'll explain why I say this) better CDC read efficiency. 
   
   - Option A works for databases, but if you notice most warehouses did not 
support a change log kind of mechanism, due to storage concerns. Lakes store 
way more data than even warehouses. 
   - The 10x efficiency gain here is going to be about moving from batch 
queries to CDC/Incremental queries and within these, the added joining of file 
slices for CDC may not be as bad as we think. We should benchmark some of the 
join costs for option B, that's a fair concern to address upfront. 
   - Honestly, Option B is much simple to implement on top of Hudi. We already 
have most pieces there. 
   
   True to my point earlier about databases - to treat them as the north star 
here . There is a common technique called "Supplemental logging" where the 
database proactively adds extra fields (i.e before image of a record) to the 
redo log, to avoid this overhead for CDC logs. 
https://docs.oracle.com/database/121/SUTIL/GUID-D2DDD67C-E1CC-45A6-A2A7-198E4C142FA3.htm#SUTIL1583
 . We can consider implementing something like this for MOR tables (which has 
an extensible data block format), to reduce this overhead of joining for CDC 
reads. COW tables, may not be able do this per see (or may be by introducing a 
new _hoodie_before field that contains the entire previous row image). 
@YannByron I am just saying that what you raise can be solved in Option B as 
well. 
   
   This is an awesome conversation folks! Glad we have such amazing talent in 
the community!  Lets work together and finalize this!
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to