I have to change the meeting to next Monday(May 2) due to a conflict. Sorry about that.
Change Data Capture for Iceberg Monday, May 2 · 9:00 – 10:00am Google Meet joining info Video call link: https://meet.google.com/pjv-cspg-xos Best, Yufei `This is not a contribution` On Tue, Apr 26, 2022 at 12:18 PM Yufei Gu <flyrain...@gmail.com> wrote: > Hi everyone, > > Here is the Change Data Capture update. I posted a draft PR( > https://github.com/apache/iceberg/pull/4539) 2 weeks ago, and got lots of > reviews. Thank you all for the review. Based on the feedback, we will move > forward with the approach and fire separated formal PRs. We are also > planning to have a meeting to share the general idea of the approach, and > next steps. Looking forward to seeing you there. Here is the meeting infor. > > Change Data Capture for Iceberg > Friday, April 29 · 9:00 – 10:00am > Google Meet joining info > Video call link: https://meet.google.com/pjv-cspg-xos > > Best, > > Yufei > > `This is not a contribution` > > > On Tue, Mar 29, 2022 at 4:33 PM Yufei Gu <flyrain...@gmail.com> wrote: > >> Synced-up with Anton and Russell for the cdc design and implementation. >> Here are changes to get deleted rows in MVP. >> >> We will leverage the `_deleted` metadata column for both pos deletes and >> eq deletes. This eliminates limitations of the original design. Especially, >> instead of emitting equality deletes directly as cdc deleted rows, we >> resolve the eq deletes to actual deleted rows and emit them as CDC delete >> rows. For example, an eq delete may delete two data rows. We will emit the >> 2 actual deleted rows. >> >> We change the design so that we emit all deleted(pos and eq) rows >> together in the same format. This is simpler and more efficient than the >> original design. >> 1. We don't have to output identifier fields. >> 2. Downstream tables can write cdc deleted rows directly as an eq deletes >> without using "merge". >> 3. It is easier to reconstruct the update in phase 2. >> >> The downside is that it is expensive for certain use cases. For example, >> it has to scan all data files to resolve global eq deletes. We can try to >> solve this by providing an option to emit eq deletes rows directly in the >> future. Please refer to >> https://github.com/apache/iceberg/issues/3941#issuecomment-1081273709 >> for more details. >> >> Let us know if you have any feedback. Thanks. >> >> Yufei >> >> >> On Wed, Mar 9, 2022 at 9:59 AM Yufei Gu <flyrain...@gmail.com> wrote: >> >>> Hi everyone, >>> >>> >>> Thanks for the joining and discussion in the sync-up last Friday. We’ve >>> got a consensus on several items: >>> >>> 1. >>> >>> The snapshot granularity CDC generation is useful, and will cover a >>> wide range of use cases. Sub-snapshot granularity is out of scope at this >>> moment, which needs a separate proposal. >>> 2. >>> >>> For COW, we should treat all rows from the deleted data files as the >>> deleted rows, which is more efficient, and more importantly, it doesn’t >>> yield wrong results when duplicate rows exist. >>> 3. >>> >>> Creating a minimum viable product (MVP) according to the current >>> design >>> >>> >>> Thanks Anton for the comments in >>> https://github.com/apache/iceberg/issues/3941#issuecomment-1061153554. >>> >>> >>> With the meetup and Anton's comment, here is the plan to move forward. We >>> split the implementation into two phases. The minimum viable product (MVP) >>> in phase 1 will have most things from the proposal with the following >>> adjustments. >>> >>> >>> *Phase 1 (MVP)* >>> >>> 1. >>> >>> To emit delete and insert CDC records only >>> 2. >>> >>> Don’t join for equality deletes. To emit equality deletes directly >>> as deleted rows per Anton’s suggestion. Otherwise, we need to join the >>> whole table with the equality delete files, which is not scalable. We >>> will >>> evaluate the cost of the join in phase 2 and support it probably, or the >>> other way to approach it. >>> 3. >>> >>> COW: to output all rows in the deleted data files as the deleted >>> rows, to output all rows in the added data files as the inserted rows. We >>> will figure out a more scalable way to filter out unchanged rows in phase >>> 2. The approach of joining on the all columns has two issues: >>> 1. >>> >>> Not scalable, think about a table with more than 100 columns >>> 2. >>> >>> Cannot handle the duplicate records, e.g. (1, Amy, 20) was in the >>> data files marked as deleted, then we got new data files with two >>> same rows >>> (1, Amy, 20) and (1, Amy, 20). >>> 4. >>> >>> User interface: to create an action to generate CDC records instead >>> of a procedure, an action can return a dataframe, which is more >>> convenient >>> than an array of InternalRow produced by a Spark procedure. >>> >>> *Phase 2* >>> >>> 1. >>> >>> Enable update reconstruction to emit CDC update records. >>> 2. >>> >>> COW: to filter out unchanged rows. >>> 3. >>> >>> User Interface: to support the metatable, which will enable more use >>> cases, e.g., streaming use case. >>> >>> >>> Best, >>> >>> Yufei >>> >>> `This is not a contribution` >>> >>> >>> On Mon, Mar 7, 2022 at 1:30 PM Anton Okolnychyi >>> <aokolnyc...@apple.com.invalid> wrote: >>> >>>> Hey folks, >>>> >>>> Based on Yufei’s design doc and what we discussed during the sync, I >>>> shared my thoughts on what can be efficiently supported right now. >>>> >>>> https://github.com/apache/iceberg/issues/3941#issuecomment-1061153554 >>>> >>>> I’d be interested to learn more about specific use cases that would >>>> violate the assumptions I listed in my comment. If you have such a use case >>>> in mind, please, comment on the issue. >>>> >>>> - Anton >>>> >>>> >>>> On 24 Feb 2022, at 14:57, Yufei Gu <flyrain...@gmail.com> wrote: >>>> >>>> Hi everyone, >>>> >>>> Move the CDC design discussion to next week's Friday(Mar 4), 9-10am >>>> PST due to an unexpected event. The meeting link will be the same, >>>> meet.google.com/vam-cmfx-feo. Thanks! >>>> >>>> Best, >>>> >>>> Yufei >>>> >>>> >>>> On Tue, Feb 22, 2022 at 12:25 PM Yufei Gu <flyrain...@gmail.com> wrote: >>>> >>>>> Hi everyone, >>>>> >>>>> It's great to see a lot of interest in the design. >>>>> We are planning to have a meeting to discuss Iceberg CDC design on >>>>> Friday(2/25) 9-10am PST. The meeting link is >>>>> meet.google.com/vam-cmfx-feo. We will talk about the general idea, as >>>>> well as open questions. The meeting will be recorded. >>>>> >>>>> >>>>> Best, >>>>> Yufei >>>>> >>>>> >>>>> On Fri, Feb 11, 2022 at 3:54 PM Holden Karau <hol...@pigscanfly.ca> >>>>> wrote: >>>>> >>>>>> Oh cool, I have not had a chance to review much of this, but I was >>>>>> having a conversation with a team which wanted similar features for a >>>>>> table >>>>>> so excited to see folks working on it 👍 >>>>>> >>>>>> On Fri, Feb 11, 2022 at 12:40 PM Yufei Gu <flyrain...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi team, >>>>>>> >>>>>>> We propose a way to generate the CDC records from the Iceberg >>>>>>> tables. It is an approach without table spec change and write-time >>>>>>> logging. >>>>>>> It will cover the majority of CDC use cases, but no guarantee to all of >>>>>>> them. We believe it's a good start point to approach CDC in the Iceberg. >>>>>>> Any feedback is welcome! >>>>>>> >>>>>>> https://docs.google.com/document/d/1bN6rdLNcYOHnT3xVBfB33BoiPO06aKBo56SZmuU9pnY/edit?usp=sharing >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Yufei >>>>>>> >>>>>> -- >>>>>> Twitter: https://twitter.com/holdenkarau >>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>> >>>>> >>>>