Re: Change Data Capture for Iceberg

Yufei Gu Wed, 27 Apr 2022 09:39:05 -0700

I have to change the meeting to next Monday(May 2) due to a conflict. Sorry
about that.


Change Data Capture for Iceberg
Monday, May 2 · 9:00 – 10:00am
Google Meet joining info
Video call link: https://meet.google.com/pjv-cspg-xos

Best,

Yufei

`This is not a contribution`


On Tue, Apr 26, 2022 at 12:18 PM Yufei Gu <flyrain...@gmail.com> wrote:

> Hi everyone,
>
> Here is the Change Data Capture update. I posted a draft PR(
> https://github.com/apache/iceberg/pull/4539) 2 weeks ago, and got lots of
> reviews. Thank you all for the review. Based on the feedback, we will move
> forward with the approach and fire separated formal PRs. We are also
> planning to have a meeting to share the general idea of the approach, and
> next steps. Looking forward to seeing you there. Here is the meeting infor.
>
> Change Data Capture for Iceberg
> Friday, April 29 · 9:00 – 10:00am
> Google Meet joining info
> Video call link: https://meet.google.com/pjv-cspg-xos
>
> Best,
>
> Yufei
>
> `This is not a contribution`
>
>
> On Tue, Mar 29, 2022 at 4:33 PM Yufei Gu <flyrain...@gmail.com> wrote:
>
>> Synced-up with Anton and Russell for the cdc design and implementation.
>> Here are changes to get deleted rows in MVP.
>>
>> We will leverage the `_deleted` metadata column for both pos deletes and
>> eq deletes. This eliminates limitations of the original design. Especially,
>> instead of emitting equality deletes directly as cdc deleted rows, we
>> resolve the eq deletes to actual deleted rows and emit them as CDC delete
>> rows. For example, an eq delete may delete two data rows. We will emit the
>> 2 actual deleted rows.
>>
>> We change the design so that we emit all deleted(pos and eq) rows
>> together in the same format. This is simpler and more efficient than the
>> original design.
>> 1. We don't have to output identifier fields.
>> 2. Downstream tables can write cdc deleted rows directly as an eq deletes
>> without using "merge".
>> 3. It is easier to reconstruct the update in phase 2.
>>
>> The downside is that it is expensive for certain use cases. For example,
>> it has to scan all data files to resolve global eq deletes. We can try to
>> solve this by providing an option to emit eq deletes rows directly in the
>> future. Please refer to
>> https://github.com/apache/iceberg/issues/3941#issuecomment-1081273709
>> for more details.
>>
>> Let us know if you have any feedback. Thanks.
>>
>> Yufei
>>
>>
>> On Wed, Mar 9, 2022 at 9:59 AM Yufei Gu <flyrain...@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>>
>>> Thanks for the joining and discussion in the sync-up last Friday. We’ve
>>> got a consensus on several items:
>>>
>>>    1.
>>>
>>>    The snapshot granularity CDC generation is useful, and will cover a
>>>    wide range of use cases. Sub-snapshot granularity is out of scope at this
>>>    moment, which needs a separate proposal.
>>>    2.
>>>
>>>    For COW, we should treat all rows from the deleted data files as the
>>>    deleted rows, which is more efficient, and more importantly, it doesn’t
>>>    yield wrong results when duplicate rows exist.
>>>    3.
>>>
>>>    Creating a minimum viable product (MVP) according to the current
>>>    design
>>>
>>>
>>> Thanks Anton for the comments in
>>> https://github.com/apache/iceberg/issues/3941#issuecomment-1061153554.
>>>
>>>
>>> With the meetup and Anton's comment, here is the plan to move forward. We
>>> split the implementation into two phases. The minimum viable product (MVP)
>>> in phase 1 will have most things from the proposal with the following
>>> adjustments.
>>>
>>>
>>> *Phase 1 (MVP)*
>>>
>>>    1.
>>>
>>>    To emit delete and insert CDC records only
>>>    2.
>>>
>>>    Don’t join for equality deletes. To emit equality deletes directly
>>>    as deleted rows per Anton’s suggestion. Otherwise, we need to join the
>>>    whole table with the equality delete files, which is not scalable. We 
>>> will
>>>    evaluate the cost of the join in phase 2 and support it probably, or the
>>>    other way to approach it.
>>>    3.
>>>
>>>    COW: to output all rows in the deleted data files as the deleted
>>>    rows, to output all rows in the added data files as the inserted rows. We
>>>    will figure out a more scalable way to filter out unchanged rows in phase
>>>    2. The approach of joining on the all columns has two issues:
>>>    1.
>>>
>>>       Not scalable, think about a table with more than 100 columns
>>>       2.
>>>
>>>       Cannot handle the duplicate records, e.g. (1, Amy, 20) was in the
>>>       data files marked as deleted, then we got new data files with two 
>>> same rows
>>>       (1, Amy, 20) and (1, Amy, 20).
>>>       4.
>>>
>>>    User interface: to create an action to generate CDC records instead
>>>    of a procedure, an action can return a dataframe, which is more 
>>> convenient
>>>    than an array of InternalRow produced by a Spark procedure.
>>>
>>> *Phase 2*
>>>
>>>    1.
>>>
>>>    Enable update reconstruction to emit CDC update records.
>>>    2.
>>>
>>>    COW: to filter out unchanged rows.
>>>    3.
>>>
>>>    User Interface: to support the metatable, which will enable more use
>>>    cases, e.g., streaming use case.
>>>
>>>
>>> Best,
>>>
>>> Yufei
>>>
>>> `This is not a contribution`
>>>
>>>
>>> On Mon, Mar 7, 2022 at 1:30 PM Anton Okolnychyi
>>> <aokolnyc...@apple.com.invalid> wrote:
>>>
>>>> Hey folks,
>>>>
>>>> Based on Yufei’s design doc and what we discussed during the sync, I
>>>> shared my thoughts on what can be efficiently supported right now.
>>>>
>>>> https://github.com/apache/iceberg/issues/3941#issuecomment-1061153554
>>>>
>>>> I’d be interested to learn more about specific use cases that would
>>>> violate the assumptions I listed in my comment. If you have such a use case
>>>> in mind, please, comment on the issue.
>>>>
>>>> - Anton
>>>>
>>>>
>>>> On 24 Feb 2022, at 14:57, Yufei Gu <flyrain...@gmail.com> wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> Move the CDC design discussion to next week's Friday(Mar 4), 9-10am
>>>> PST due to an unexpected event. The meeting link will be the same,
>>>> meet.google.com/vam-cmfx-feo. Thanks!
>>>>
>>>> Best,
>>>>
>>>> Yufei
>>>>
>>>>
>>>> On Tue, Feb 22, 2022 at 12:25 PM Yufei Gu <flyrain...@gmail.com> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> It's great to see a lot of interest in the design.
>>>>> We are planning to have a meeting to discuss Iceberg CDC design on
>>>>> Friday(2/25) 9-10am PST. The meeting link is
>>>>> meet.google.com/vam-cmfx-feo. We will talk about the general idea, as
>>>>> well as open questions. The meeting will be recorded.
>>>>>
>>>>>
>>>>> Best,
>>>>> Yufei
>>>>>
>>>>>
>>>>> On Fri, Feb 11, 2022 at 3:54 PM Holden Karau <hol...@pigscanfly.ca>
>>>>> wrote:
>>>>>
>>>>>> Oh cool, I have not had a chance to review much of this, but I was
>>>>>> having a conversation with a team which wanted similar features for a 
>>>>>> table
>>>>>> so excited to see folks working on it 👍
>>>>>>
>>>>>> On Fri, Feb 11, 2022 at 12:40 PM Yufei Gu <flyrain...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi team,
>>>>>>>
>>>>>>> We propose a way to generate the CDC records from the Iceberg
>>>>>>> tables. It is an approach without table spec change and write-time 
>>>>>>> logging.
>>>>>>> It will cover the majority of CDC use cases, but no guarantee to all of
>>>>>>> them. We believe it's a good start point to approach CDC in the Iceberg.
>>>>>>> Any feedback is welcome!
>>>>>>>
>>>>>>> https://docs.google.com/document/d/1bN6rdLNcYOHnT3xVBfB33BoiPO06aKBo56SZmuU9pnY/edit?usp=sharing
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Yufei
>>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>
>>>>>
>>>>

Re: Change Data Capture for Iceberg

Reply via email to