[GitHub] [hudi] vinothchandar commented on pull request #5436: [RFC-51] [HUDI-3478] Change Data Capture RFC

GitBox Thu, 09 Jun 2022 07:44:58 -0700


vinothchandar commented on PR #5436:
URL: https://github.com/apache/hudi/pull/5436#issuecomment-1151213942

> because computing the diff on the fly is not a heavy work like you
previously mentioned.

I think this is the crux of the disagreement before and key concern from
@YannByron . Let's take a COW example where the records changed in the latest
file slice spread evenly across row groups/pages (red lines on the side) in the
previous file slice. To compute the diff, we need to potentially read the
entire previous file slice.

If we have a CDC log attached to the slice at t1, then we only read the
changed rows. So there is a difference IMO in terms of total query I/O i.e
bytes read per CDC query. This has been our single biggest source of confusion.
We say - we make it flexible around on-the-fly vs materializing and then go
back right into claiming either materializing is not needed :) and/or
on-the-fly is cheap. Should we actually microbenchmark both approaches and be
done?

>the MOR table has different layout for different index,

yes that my meta point that our design should not be based on handles, but a
more generic CDC log format. In the case, inserts are sent to log (even for
Spark/HBase index, they are IIRC), then we would encode `I` in the CDC format
correctly.
This is what I feel " we need to design this even more generically at the
file group/slice level"

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] vinothchandar commented on pull request #5436: [RFC-51] [HUDI-3478] Change Data Capture RFC

Reply via email to