Hey Gustavo, Xuyang
I tried incorporating your suggestions into the FLIP. Please take another
look.
Best,
Dawid

On Fri, 12 Dec 2025 at 16:05, Dawid Wysakowicz <[email protected]>
wrote:

> 1. The default behavior changes if no ON CONFLICT is defined. I am a
>> little concerned that this may cause errors in a large number of existing
>> cases.
>
> I can be convinced to leave the default behaviour as it is now. I am
> worried though, very rarely the current behaviour of SUM is what people
> actually want. As mentioned in the FLIP I wholeheartedly believe there are
> very little if any real world scenarios where you need the deduplicate
> behaviour. I try to elaborate a bit more in 2)
>
> 2. Regarding On Conflict Errors, in the context of CDC streams, it is
>> expected that the vast majority of cases cannot generate only one record
>> with one primary key. The only solutions I can think of are append-only
>> top1, deduplication, or aggregating the first row.
>
> I disagree with that statement. I don't think CDC streams change anything
> in that regard. Maybe there is some misunderstanding about what a one
> record means in this context.
>
> I agree almost certainly there will be a sequence of UA, UB for a single
> sink's primary key.
>
> My claim is that users almost never want a situation where they have more
> than one "active" upsert key/record for one sink's primary key. I tried to
> explain that in the FLIP, but let me try to give one more example here.
>
> Imagine two tables:
> CREATE TABLE source (
>   id bigint PRIMARY KEY,
>   name string,
>   value string
> )
>
> CREATE TABLE sink (
>   name string PRIMARY KEY,
>   value string
> )
>
> INSERT INTO sink SELECT name, value;
>
> === Input
> (1, "Apple", "ABC")
> (2, "Apple", "DEF")
>
> In the scenario above a SUM is inserted which will deduplicate the rows
> and override the value for "Apple" with "DEF". In my opinion it's entirely
> wrong, instead an exception should be thrown that there is actually a
> constraint validation.
>
> I am absolutely more than happy to be proved wrong. If you do have a real
> world scenario where the deduplication logic is actually correct and
> expected please, please do share. So far I have not seen one, nor was I
> able to come up with one. And yet I am not suggesting to remove the
> deduplication logic entirely, users can still use it with ON CONFLICT
> DEDUPLICATE.
>
> 3. The special watermark generation interval affects the visibility of
>> results. How can users configure this generation interval?
>
>
> That's a fair question I'll try to elaborate on in the FLIP. I can see two
> options:
> 1. We piggyback on existing watermarks in the query, if there are no
> watermarks (tables don't have a watermark definition) we fail during
> planning
> 2. We add a new parameter option for a specialized generalized watermark
>
> Let me think for some more on that and I'll come back with a more concrete
> proposal.
>
>
>> 4. I believe that resolving out-of-order issues and addressing internal
>> consistency are two separate problems. As I understand the current
>> solution, it does not  really resolve the internal consistency issue. We
>> could first resolve the out-of-order problem. For most scenarios that
>> require real-time response, we can directly output intermediate results
>> promptly.
>
>
> Why doesn't it solve it? It does. Given a pair of UB/UA we won't emit the
> temporary state after processing the UB.
>
> 5. How can we compact data with the same custom watermark? If detailed
>> comparisons are necessary, I think we still need to preserve all key data;
>> we would just be compressing this data further at time t.
>
> Yes, we need to preserve all key data, but only between two watermarks.
> Assuming frequent watermarks, that's for a very short time.
>
> 6. If neither this proposed solution nor the reject solution can resolve
>> internal consistency, we need to reconsider the differences between the two
>> approaches.
>
> I'll copy the explanation why the rejected alternative should be rejected
> from the FLIP:
>
> The solution can help us to solve the changelog disorder problem, but it
> does not help with the *internal consistency *issue. If we want to fix
> that as well, we still need the compaction on watermarks. At the same time
> it increases the size of all flowing records. Therefore it was rejected in
> favour of simply compacting all records once on the progression of
> watermarks.
>
> Best,
> Dawid
>

Reply via email to