Re: FLIP-510: Drop ChangelogNormalize for operations which don't need it

Timo Walther Tue, 11 Mar 2025 10:03:06 -0700

Hi Dawid,

thanks for this proposal. This is a very nice improvement to the SQLengine. Changelog normalize is a very state-intensive operation. Anypossibility to avoid it should be implemented.

Updating ChangelogMode is an elegant solution and avoid updating theRowKind. When we introduced the ChangelogMode class, the class wasalready prepared to add more than just a list of enums. I guess now isthe time for it.


+1 for the proposal.

One nit comment around naming: Currently, it reads like a verb"ChangelogMode.deletesByKeyOnly()". As if the changelog mode "deletes"something. Maybe swap the words like "ChangelogMode.keyOnlyDeletes()".


Regards,
Timo



On 10.03.25 04:04, Xuyang wrote:

I have no other questions. +1 for it.




--

     Best！
     Xuyang





At 2025-03-07 19:37:09, "Dawid Wysakowicz" <dwysakow...@apache.org> wrote:


 From my understanding, for a sink, if its schema includes a primary key,
we can assume it has
the ability to process delete messages (with '-D') and perform deletions
by key (PK). If it does not
include a PK, we would implicitly treat it as a log-structured table that
supports full row deletions.



I am afraid this assumption is too far going. PK is information about
columns uniqueness and that's it. It does not tell us what is required to
perform a DELETE operation. I agree the assumption would most often hold,
but I am afraid it is not guaranteed. E.g. In a log based systems one may
just want to have full information encoded in the DELETE messages. (e.g. in
a debezium message)

Same holds for sources. Even though theoretically, if there is a PK,
deletes could contain only the key information, but the source may just as
well produce DELETEs with all fields set.

Given that you mentioned `PARTIAL_DELETE`, should I interpret this as

referring to a scenario
similar to wide tables, where if the sink has a PK, some columns are
deleted (set to null or through other operations) while others remain
unchanged?



No. The effect is the same. That the ROW is deleted/disappears. The
difference is what is required to perform the deletion. In some cases it
may be enough to have the PK to perform the deletion and then we don't need
the information about other columns, but there may be systems that require
all columns to be set.

By the way, since the flag applies both for sources and sinks to tell what
is the expected format of DELETE records produced/consumed I renamed the
flag in the FLIP:
supportsDeleteByKey -> deletesByKeOnly.

Let me know if there are other questions. If there are none, I'd like to
start a vote in the upcoming days.

Best,
Dawid


On Mon, 3 Mar 2025 at 07:29, Xuyang <xyzhong...@163.com> wrote:

Hi, Dawid.

Thanks for your response. I believe I've identified a key point, but I’m a
bit unclear about the

following you said. Could you please provide an example for clarification?

```

The only missing information is if the external sink can consume deletes
by key and if a source

produces full deletes or deletes by key.

```

 From my understanding, for a sink, if its schema includes a primary key,
we can assume it has

the ability to process delete messages (with '-D') and perform deletions
by key (PK). If it does not

include a PK, we would implicitly treat it as a log-structured table that
supports full row deletions.

Given that you mentioned `PARTIAL_DELETE`, should I interpret this as
referring to a scenario

similar to wide tables, where if the sink has a PK, some columns are
deleted (set to null or through

other operations) while others remain unchanged?

Looking forward your reply.







--

     Best！
     Xuyang





At 2025-02-28 19:16:12, "Dawid Wysakowicz" <wysakowicz.da...@gmail.com>
wrote:

Hey Xuyang,
Ad. 1
Yes, you're right, but we already do that for determining if we need
UPDATE_BEFORE or not. FlinkChangelogModeInferenceProgram already deals

with

that.
Ad. 2
Unfortunately it is. This is also the only reason I need a FLIP. We can
determine internally for every internal operator if we can work with
partial deletes or if we need full deletes. The only missing information

is

if the external sink can consume deletes by key and if a source produces
full deletes or deletes by key. Unfortunately this is information that
comes from a connector implementation and thus needs to be provided via a
public API.
Ad. 3
With ChangelogMode#kinds -> to some degree yes. We theoretically could
split RowKind#DELETE to RowKind#DELETE_BY_KEY and RowKind#FULL_DELETE.
However, that change would 1) be much more involved 2) we would need to
encode that information in every single message, which I think is not
necessary. I don't think it has much to do with PK.
Ad.4
I don't think so. PK information is part of Schema not about the kind of
messages. We don't have PK information for UPDATE_BEFORE/UPDATE_AFTER and
they also apply per key. If the name containing `DELETE_BY_KEY` is
confusing I am happy to rename it to e.g. PARTIAL_DELETE, therefore I'd

add

`supportsPartialDeletes`

Best,
Dawid

On Fri, 28 Feb 2025 at 04:43, Xuyang <xyzhong...@163.com> wrote:

Hi Dawid.




Big +1 for this FLIP. After reading through it, I have a few questions

and

would appreciate your responses:

1. IIUC, we only need to provide additional information in the
`FlinkChangelogModeInferenceProgram` to enable the

inference program to determine whether it is safe to remove
`ChangelogNormalize`. My first instinct is that we need to

know if all subsequent output-side nodes consuming Upsert Keys include

the

Upsert Keys provided by the input-side operator (source).

If this condition is met, we can safely eliminate `ChangelogNormalize`.
Perhaps, I have missed some important points, so please feel

free to correct me if necessary.

2. The introduction of `supportsDeleteByKey` in ChangelogMode seems to
exist solely as auxiliary information for the

`FlinkChangelogModeInferenceProgram`. If that's the case, it doesn't

seem

necessary to expose it in the public API, does it?

3. If the purpose of introducing `supportsDeleteByKey` in ChangelogMode

is

to facilitate support for `#fromChangelogStream`

and `#toChangelogStream`, it appears that `supportsDeleteByKey` might
overlap with ChangelogMode#kinds and Schema#PK

to some extent, right?

4. Regarding supportsDeleteByKey, as part of a complete ChangelogMode
entity, should we also store the specific key information?







--

     Best！
     Xuyang





在 2025-02-28 04:27:19，"Martijn Visser" <martijnvis...@apache.org> 写道：

Hi Dawid,

Thanks for the FLIP, looks like a good improvement for me that will

bring

lot of benefits. +1

Best regards,

Martijn

On Tue, Feb 25, 2025 at 6:51 AM Sergey Nuyanzin <snuyan...@gmail.com>

wrote:

+1 for such improvement

On Mon, Feb 24, 2025 at 12:01 PM Dawid Wysakowicz
<wysakowicz.da...@gmail.com> wrote:


Hi everyone,

I would like to initiate a discussion for the FLIP-510[1] below,

which

aims

on optimising certain use cases in SQL which at the moment add
ChangelogNormalize, but don't necessarily need to do it.

Looking forward to hearing from you.

[1] https://cwiki.apache.org/confluence/x/7o5EF




--
Best regards,
Sergey

Re: FLIP-510: Drop ChangelogNormalize for operations which don't need it

Reply via email to