vinothchandar commented on PR #5436: URL: https://github.com/apache/hudi/pull/5436#issuecomment-1116735024
@danny0405 @YannByron I see the major sticking point is - Option A) separate `.cdc` folder, that contains the CDC log (similar to redo logs in databases) Option B) doing it using `_hoodie_operation` flag, effectively the CDC log is stored inline with the data. Few considerations that I think did not come across that well in the discussion above. @YannByron - Danny's concerns around double writing to the .cdc log and then the actual data file centers not around transactionality, but all the work we need to now do to implement skipping of such partial data written in the CDC read path. This comes for free if we leverage the data files (base/log) as the CDC log itself. - Also similar on the management of `.cdc` i.e cleaning, clustering, small files - every problem we solve for data files, we need to solve for CDC log again - Writing every byte 2 times (even if the data changing every day is a fraction of the total table size), will cost multi-million dollars more for existing large users! (I am not exaggerating here :), I have seen what some hudi user's scales look like). This is Danny's main concern as well I think. @danny0405 - Yann's concerns are around the cost of "joining" different file slices together to generate the CDC stream, which is a valid concern as well. There is more "compute" cost paid per CDC query in this approach. If you press me, I am still leaning on Option B and do it inline (i.e trade off the simplicity of implementation + reduced storage cost) over potentially (I'll explain why I say this) better CDC read efficiency. - Option A works for databases, but if you notice most warehouses did not support a change log kind of mechanism, due to storage concerns. Lakes store way more data than even warehouses. - The 10x efficiency gain here is going to be about moving from batch queries to CDC/Incremental queries and within these, the added joining of file slices for CDC may not be as bad as we think. We should benchmark some of the join costs for option B, that's a fair concern to address upfront. - Honestly, Option B is much simple to implement on top of Hudi. We already have most pieces there. True to my point earlier about databases - to treat them as the north star here . There is a common technique called "Supplemental logging" where the database proactively adds extra fields (i.e before image of a record) to the redo log, to avoid this overhead for CDC logs. https://docs.oracle.com/database/121/SUTIL/GUID-D2DDD67C-E1CC-45A6-A2A7-198E4C142FA3.htm#SUTIL1583 . We can consider implementing something like this for MOR tables (which has an extensible data block format), to reduce this overhead of joining for CDC reads. COW tables, may not be able do this per see (or may be by introducing a new _hoodie_before field that contains the entire previous row image). @YannByron I am just saying that what you raise can be solved in Option B as well. This is an awesome conversation folks! Glad we have such amazing talent in the community! Lets work together and finalize this! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
