[GitHub] [hudi] vinothchandar commented on pull request #5436: [RFC-51] [HUDI-3478] Change Data Capture RFC

GitBox Tue, 03 May 2022 15:55:05 -0700


vinothchandar commented on PR #5436:
URL: https://github.com/apache/hudi/pull/5436#issuecomment-1116735024

@danny0405 @YannByron

I see the major sticking point is -

Option A) separate `.cdc` folder, that contains the CDC log (similar to redo
logs in databases)
Option B) doing it using `_hoodie_operation` flag, effectively the CDC log
is stored inline with the data.

Few considerations that I think did not come across that well in the
discussion above.

@YannByron

- Danny's concerns around double writing to the .cdc log and then the actual
data file centers not around transactionality, but all the work we need to now
do to implement skipping of such partial data written in the CDC read path.
This comes for free if we leverage the data files (base/log) as the CDC log
itself.
- Also similar on the management of `.cdc` i.e cleaning, clustering, small
files - every problem we solve for data files, we need to solve for CDC log
again
- Writing every byte 2 times (even if the data changing every day is a
fraction of the total table size), will cost multi-million dollars more for
existing large users! (I am not exaggerating here :), I have seen what some
hudi user's scales look like). This is Danny's main concern as well I think.

@danny0405
- Yann's concerns are around the cost of "joining" different file slices
together to generate the CDC stream, which is a valid concern as well. There is
more "compute" cost paid per CDC query in this approach.

If you press me, I am still leaning on Option B and do it inline (i.e trade
off the simplicity of implementation + reduced storage cost) over potentially
(I'll explain why I say this) better CDC read efficiency.

- Option A works for databases, but if you notice most warehouses did not
support a change log kind of mechanism, due to storage concerns. Lakes store
way more data than even warehouses.
- The 10x efficiency gain here is going to be about moving from batch
queries to CDC/Incremental queries and within these, the added joining of file
slices for CDC may not be as bad as we think. We should benchmark some of the
join costs for option B, that's a fair concern to address upfront.
- Honestly, Option B is much simple to implement on top of Hudi. We already
have most pieces there.

True to my point earlier about databases - to treat them as the north star
here . There is a common technique called "Supplemental logging" where the
database proactively adds extra fields (i.e before image of a record) to the
redo log, to avoid this overhead for CDC logs.
https://docs.oracle.com/database/121/SUTIL/GUID-D2DDD67C-E1CC-45A6-A2A7-198E4C142FA3.htm#SUTIL1583
. We can consider implementing something like this for MOR tables (which has
an extensible data block format), to reduce this overhead of joining for CDC
reads. COW tables, may not be able do this per see (or may be by introducing a
new _hoodie_before field that contains the entire previous row image).
@YannByron I am just saying that what you raise can be solved in Option B as
well.

This is an awesome conversation folks! Glad we have such amazing talent in
the community! Lets work together and finalize this!

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] vinothchandar commented on pull request #5436: [RFC-51] [HUDI-3478] Change Data Capture RFC

Reply via email to