From my understanding, for a sink, if its schema includes a primary key,
we can assume it has
the ability to process delete messages (with '-D') and perform deletions
by key (PK). If it does not
include a PK, we would implicitly treat it as a log-structured table that
supports full row deletions.
I am afraid this assumption is too far going. PK is information about
columns uniqueness and that's it. It does not tell us what is required to
perform a DELETE operation. I agree the assumption would most often hold,
but I am afraid it is not guaranteed. E.g. In a log based systems one may
just want to have full information encoded in the DELETE messages. (e.g. in
a debezium message)
Same holds for sources. Even though theoretically, if there is a PK,
deletes could contain only the key information, but the source may just as
well produce DELETEs with all fields set.
Given that you mentioned `PARTIAL_DELETE`, should I interpret this as
referring to a scenario
similar to wide tables, where if the sink has a PK, some columns are
deleted (set to null or through other operations) while others remain
unchanged?
No. The effect is the same. That the ROW is deleted/disappears. The
difference is what is required to perform the deletion. In some cases it
may be enough to have the PK to perform the deletion and then we don't need
the information about other columns, but there may be systems that require
all columns to be set.
By the way, since the flag applies both for sources and sinks to tell what
is the expected format of DELETE records produced/consumed I renamed the
flag in the FLIP:
supportsDeleteByKey -> deletesByKeOnly.
Let me know if there are other questions. If there are none, I'd like to
start a vote in the upcoming days.
Best,
Dawid
On Mon, 3 Mar 2025 at 07:29, Xuyang <xyzhong...@163.com> wrote:
Hi, Dawid.
Thanks for your response. I believe I've identified a key point, but I’m a
bit unclear about the
following you said. Could you please provide an example for clarification?
```
The only missing information is if the external sink can consume deletes
by key and if a source
produces full deletes or deletes by key.
```
From my understanding, for a sink, if its schema includes a primary key,
we can assume it has
the ability to process delete messages (with '-D') and perform deletions
by key (PK). If it does not
include a PK, we would implicitly treat it as a log-structured table that
supports full row deletions.
Given that you mentioned `PARTIAL_DELETE`, should I interpret this as
referring to a scenario
similar to wide tables, where if the sink has a PK, some columns are
deleted (set to null or through
other operations) while others remain unchanged?
Looking forward your reply.
--
Best!
Xuyang
At 2025-02-28 19:16:12, "Dawid Wysakowicz" <wysakowicz.da...@gmail.com>
wrote:
Hey Xuyang,
Ad. 1
Yes, you're right, but we already do that for determining if we need
UPDATE_BEFORE or not. FlinkChangelogModeInferenceProgram already deals
with
that.
Ad. 2
Unfortunately it is. This is also the only reason I need a FLIP. We can
determine internally for every internal operator if we can work with
partial deletes or if we need full deletes. The only missing information
is
if the external sink can consume deletes by key and if a source produces
full deletes or deletes by key. Unfortunately this is information that
comes from a connector implementation and thus needs to be provided via a
public API.
Ad. 3
With ChangelogMode#kinds -> to some degree yes. We theoretically could
split RowKind#DELETE to RowKind#DELETE_BY_KEY and RowKind#FULL_DELETE.
However, that change would 1) be much more involved 2) we would need to
encode that information in every single message, which I think is not
necessary. I don't think it has much to do with PK.
Ad.4
I don't think so. PK information is part of Schema not about the kind of
messages. We don't have PK information for UPDATE_BEFORE/UPDATE_AFTER and
they also apply per key. If the name containing `DELETE_BY_KEY` is
confusing I am happy to rename it to e.g. PARTIAL_DELETE, therefore I'd
add
`supportsPartialDeletes`
Best,
Dawid
On Fri, 28 Feb 2025 at 04:43, Xuyang <xyzhong...@163.com> wrote:
Hi Dawid.
Big +1 for this FLIP. After reading through it, I have a few questions
and
would appreciate your responses:
1. IIUC, we only need to provide additional information in the
`FlinkChangelogModeInferenceProgram` to enable the
inference program to determine whether it is safe to remove
`ChangelogNormalize`. My first instinct is that we need to
know if all subsequent output-side nodes consuming Upsert Keys include
the
Upsert Keys provided by the input-side operator (source).
If this condition is met, we can safely eliminate `ChangelogNormalize`.
Perhaps, I have missed some important points, so please feel
free to correct me if necessary.
2. The introduction of `supportsDeleteByKey` in ChangelogMode seems to
exist solely as auxiliary information for the
`FlinkChangelogModeInferenceProgram`. If that's the case, it doesn't
seem
necessary to expose it in the public API, does it?
3. If the purpose of introducing `supportsDeleteByKey` in ChangelogMode
is
to facilitate support for `#fromChangelogStream`
and `#toChangelogStream`, it appears that `supportsDeleteByKey` might
overlap with ChangelogMode#kinds and Schema#PK
to some extent, right?
4. Regarding supportsDeleteByKey, as part of a complete ChangelogMode
entity, should we also store the specific key information?
--
Best!
Xuyang
在 2025-02-28 04:27:19,"Martijn Visser" <martijnvis...@apache.org> 写道:
Hi Dawid,
Thanks for the FLIP, looks like a good improvement for me that will
bring
a
lot of benefits. +1
Best regards,
Martijn
On Tue, Feb 25, 2025 at 6:51 AM Sergey Nuyanzin <snuyan...@gmail.com>
wrote:
+1 for such improvement
On Mon, Feb 24, 2025 at 12:01 PM Dawid Wysakowicz
<wysakowicz.da...@gmail.com> wrote:
Hi everyone,
I would like to initiate a discussion for the FLIP-510[1] below,
which
aims
on optimising certain use cases in SQL which at the moment add
ChangelogNormalize, but don't necessarily need to do it.
Looking forward to hearing from you.
[1] https://cwiki.apache.org/confluence/x/7o5EF
--
Best regards,
Sergey