[GitHub] [hudi] vinothchandar commented on pull request #5436: [RFC-51] [HUDI-3478] Change Data Capture RFC

GitBox Wed, 01 Jun 2022 00:51:28 -0700


vinothchandar commented on PR #5436:
URL: https://github.com/apache/hudi/pull/5436#issuecomment-1143238175

@danny0405 @YannByron I went over our conversations, re-read this RFC and
the google doc. The main outstanding debate is whether to materialize the
`before` image of the records during write time (batch friendly) or derive it
on the fly during query time (streaming friendly). Materializing costs more $$
in terms of storage cost + the additional compute cost for double writing.

Conceptually, deriving it on the fly (using just a `_hoodie_operation` field
added to each record) needs more work on CDC query side. E.g: let's take a CDC
query for changes between commit times [t1, tN]. For each file group, we need
to start with the latest file slice as of t1, extract rows that changed after
or at t1, and extract its before image from the file slice before. Effectively,
this will incur double reading. But even with materialization, we cannot really
avoid this, can we?

![image](https://user-images.githubusercontent.com/1179324/171354552-e0526753-9c45-41f7-ba46-808af864aee5.png)

@YannByron Can you quantify the extra cost in stitching/diffing on the fly?
Is that performance sensitive? Trying to understand why you think if we don't
materialize the CDC query cost will be high.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] vinothchandar commented on pull request #5436: [RFC-51] [HUDI-3478] Change Data Capture RFC

Reply via email to