Re: Re: Re: FLIP-510: Drop ChangelogNormalize for operations which don't need it

Dawid Wysakowicz Fri, 07 Mar 2025 03:57:47 -0800

>
> From my understanding, for a sink, if its schema includes a primary key,
> we can assume it has
> the ability to process delete messages (with '-D') and perform deletions
> by key (PK). If it does not
> include a PK, we would implicitly treat it as a log-structured table that
> supports full row deletions.



I am afraid this assumption is too far going. PK is information about
columns uniqueness and that's it. It does not tell us what is required to
perform a DELETE operation. I agree the assumption would most often hold,
but I am afraid it is not guaranteed. E.g. In a log based systems one may
just want to have full information encoded in the DELETE messages. (e.g. in
a debezium message)

Same holds for sources. Even though theoretically, if there is a PK,
deletes could contain only the key information, but the source may just as
well produce DELETEs with all fields set.

Given that you mentioned `PARTIAL_DELETE`, should I interpret this as
> referring to a scenario
> similar to wide tables, where if the sink has a PK, some columns are
> deleted (set to null or through other operations) while others remain
> unchanged?


No. The effect is the same. That the ROW is deleted/disappears. The
difference is what is required to perform the deletion. In some cases it
may be enough to have the PK to perform the deletion and then we don't need
the information about other columns, but there may be systems that require
all columns to be set.

By the way, since the flag applies both for sources and sinks to tell what
is the expected format of DELETE records produced/consumed I renamed the
flag in the FLIP:
supportsDeleteByKey -> deletesByKeOnly.

Let me know if there are other questions. If there are none, I'd like to
start a vote in the upcoming days.

Best,
Dawid


On Mon, 3 Mar 2025 at 07:29, Xuyang <[email protected]> wrote:

> Hi, Dawid.
>
> Thanks for your response. I believe I've identified a key point, but I’m a
> bit unclear about the
>
> following you said. Could you please provide an example for clarification?
>
> ```
>
> The only missing information is if the external sink can consume deletes
> by key and if a source
>
> produces full deletes or deletes by key.
>
> ```
>
> From my understanding, for a sink, if its schema includes a primary key,
> we can assume it has
>
> the ability to process delete messages (with '-D') and perform deletions
> by key (PK). If it does not
>
> include a PK, we would implicitly treat it as a log-structured table that
> supports full row deletions.
>
> Given that you mentioned `PARTIAL_DELETE`, should I interpret this as
> referring to a scenario
>
> similar to wide tables, where if the sink has a PK, some columns are
> deleted (set to null or through
>
> other operations) while others remain unchanged?
>
> Looking forward your reply.
>
>
>
>
>
>
>
> --
>
>     Best！
>     Xuyang
>
>
>
>
>
> At 2025-02-28 19:16:12, "Dawid Wysakowicz" <[email protected]>
> wrote:
> >Hey Xuyang,
> >Ad. 1
> >Yes, you're right, but we already do that for determining if we need
> >UPDATE_BEFORE or not. FlinkChangelogModeInferenceProgram already deals
> with
> >that.
> >Ad. 2
> >Unfortunately it is. This is also the only reason I need a FLIP. We can
> >determine internally for every internal operator if we can work with
> >partial deletes or if we need full deletes. The only missing information
> is
> >if the external sink can consume deletes by key and if a source produces
> >full deletes or deletes by key. Unfortunately this is information that
> >comes from a connector implementation and thus needs to be provided via a
> >public API.
> >Ad. 3
> >With ChangelogMode#kinds -> to some degree yes. We theoretically could
> >split RowKind#DELETE to RowKind#DELETE_BY_KEY and RowKind#FULL_DELETE.
> >However, that change would 1) be much more involved 2) we would need to
> >encode that information in every single message, which I think is not
> >necessary. I don't think it has much to do with PK.
> >Ad.4
> >I don't think so. PK information is part of Schema not about the kind of
> >messages. We don't have PK information for UPDATE_BEFORE/UPDATE_AFTER and
> >they also apply per key. If the name containing `DELETE_BY_KEY` is
> >confusing I am happy to rename it to e.g. PARTIAL_DELETE, therefore I'd
> add
> >`supportsPartialDeletes`
> >
> >Best,
> >Dawid
> >
> >On Fri, 28 Feb 2025 at 04:43, Xuyang <[email protected]> wrote:
> >
> >> Hi Dawid.
> >>
> >>
> >>
> >>
> >> Big +1 for this FLIP. After reading through it, I have a few questions
> and
> >> would appreciate your responses:
> >>
> >> 1. IIUC, we only need to provide additional information in the
> >> `FlinkChangelogModeInferenceProgram` to enable the
> >>
> >> inference program to determine whether it is safe to remove
> >> `ChangelogNormalize`. My first instinct is that we need to
> >>
> >> know if all subsequent output-side nodes consuming Upsert Keys include
> the
> >> Upsert Keys provided by the input-side operator (source).
> >>
> >> If this condition is met, we can safely eliminate `ChangelogNormalize`.
> >> Perhaps, I have missed some important points, so please feel
> >>
> >> free to correct me if necessary.
> >>
> >> 2. The introduction of `supportsDeleteByKey` in ChangelogMode seems to
> >> exist solely as auxiliary information for the
> >>
> >> `FlinkChangelogModeInferenceProgram`. If that's the case, it doesn't
> seem
> >> necessary to expose it in the public API, does it?
> >>
> >> 3. If the purpose of introducing `supportsDeleteByKey` in ChangelogMode
> is
> >> to facilitate support for `#fromChangelogStream`
> >>
> >> and `#toChangelogStream`, it appears that `supportsDeleteByKey` might
> >> overlap with ChangelogMode#kinds and Schema#PK
> >>
> >> to some extent, right?
> >>
> >> 4. Regarding supportsDeleteByKey, as part of a complete ChangelogMode
> >> entity, should we also store the specific key information?
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> --
> >>
> >>     Best！
> >>     Xuyang
> >>
> >>
> >>
> >>
> >>
> >> 在 2025-02-28 04:27:19，"Martijn Visser" <[email protected]> 写道：
> >> >Hi Dawid,
> >> >
> >> >Thanks for the FLIP, looks like a good improvement for me that will
> bring
> >> a
> >> >lot of benefits. +1
> >> >
> >> >Best regards,
> >> >
> >> >Martijn
> >> >
> >> >On Tue, Feb 25, 2025 at 6:51 AM Sergey Nuyanzin <[email protected]>
> >> wrote:
> >> >
> >> >> +1 for such improvement
> >> >>
> >> >> On Mon, Feb 24, 2025 at 12:01 PM Dawid Wysakowicz
> >> >> <[email protected]> wrote:
> >> >> >
> >> >> > Hi everyone,
> >> >> >
> >> >> > I would like to initiate a discussion for the FLIP-510[1] below,
> which
> >> >> aims
> >> >> > on optimising certain use cases in SQL which at the moment add
> >> >> > ChangelogNormalize, but don't necessarily need to do it.
> >> >> >
> >> >> > Looking forward to hearing from you.
> >> >> >
> >> >> > [1] https://cwiki.apache.org/confluence/x/7o5EF
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Best regards,
> >> >> Sergey
> >> >>
> >>
>

Re: Re: Re: FLIP-510: Drop ChangelogNormalize for operations which don't need it

Reply via email to