[ 
https://issues.apache.org/jira/browse/FLINK-36664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaitian updated FLINK-36664:
----------------------------
            Attachment: image-2024-11-06-20-23-53-184.png
                        image-2024-11-06-20-20-34-562.png
              Language: Java
           Description: 
When setting the offset for the window, data is lost because the triggering 
window time is not aligned.

 

When SlicingWindowOperator processesWatermarker, it records the time when the 
next window is triggered (nextTriggerWatermark):

!image-2024-11-06-20-20-34-562.png!

The calculation method is to first calculate the begin time of the window where 
the watermark is located, but the offset passed in during the calculation is 0:

!image-2024-11-06-20-23-53-184.png!

That is to say, the window triggering time does not take into account the 
window offset.
It is OK if the nextTriggerWatermark is too small. When watermark-offset>0, if 
(watermark-offset)%window_size > watermark%window_size is satisfied, the 
nextTriggerWatermark will be too large, where offset and window_size are 
constants. If the watermark is completely random, it is easy to prove that 
there is a 50% probability that the nextTriggerWatermark will be too large.


When the nextTriggerWatermark is too large, a processWatermark should have 
flushWindowBuffer but was not triggered, resulting in less data in the 
currently triggered window (assuming it is key1). When the next 
processWatermark triggers flushWindowBuffer, since the Watermark has moved 
forward, key1 will be regarded as expired data and the timer will not be 
registered. That is to say, the subsequent processWatermark will no longer 
calculate key1, and data will be lost.

 

I have writen a UT to prove this bug:

window_size 3000, offset 1000, tumping window
window_size 3000, offset 1000. When processWatermark(3000), the normally 
calculated nextTriggerProgress = 3000 - (3000-1000)%3000 + 3000-1 = 3999, but 
because the code does not consider the offset, nextTriggerProgress = 3000 - 
(3000-0)%3000 + 3000-1 = 5999, which is too large.We will lose data.

!https://intranetproxy.alipay.com/skylark/lark/0/2024/png/101856358/1730893082766-4ea5d356-2b19-436d-be46-093f70e445cd.png!

This wrong UT will pass. You can see that the data in the UT is lost and will 
not be calculated in any subsequent processWatermark.

 

Repair suggestion:

pass window_offset when calculating nextTriggerWatermark

 

  was:
When setting the offset for the window, data is lost because the triggering 
window time is not aligned.

 

When SlicingWindowOperator processesWatermarker, it records the time when the 
next window is triggered (nextTriggerWatermark):

                Labels: window  (was: )
    Remaining Estimate: 1h
     Original Estimate: 1h

> [Window]The window with window_offset will lose data.
> -----------------------------------------------------
>
>                 Key: FLINK-36664
>                 URL: https://issues.apache.org/jira/browse/FLINK-36664
>             Project: Flink
>          Issue Type: Bug
>            Reporter: kaitian
>            Priority: Major
>              Labels: window
>         Attachments: image-2024-11-06-20-20-34-562.png, 
> image-2024-11-06-20-23-53-184.png
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When setting the offset for the window, data is lost because the triggering 
> window time is not aligned.
>  
> When SlicingWindowOperator processesWatermarker, it records the time when the 
> next window is triggered (nextTriggerWatermark):
> !image-2024-11-06-20-20-34-562.png!
> The calculation method is to first calculate the begin time of the window 
> where the watermark is located, but the offset passed in during the 
> calculation is 0:
> !image-2024-11-06-20-23-53-184.png!
> That is to say, the window triggering time does not take into account the 
> window offset.
> It is OK if the nextTriggerWatermark is too small. When watermark-offset>0, 
> if (watermark-offset)%window_size > watermark%window_size is satisfied, the 
> nextTriggerWatermark will be too large, where offset and window_size are 
> constants. If the watermark is completely random, it is easy to prove that 
> there is a 50% probability that the nextTriggerWatermark will be too large.
> When the nextTriggerWatermark is too large, a processWatermark should have 
> flushWindowBuffer but was not triggered, resulting in less data in the 
> currently triggered window (assuming it is key1). When the next 
> processWatermark triggers flushWindowBuffer, since the Watermark has moved 
> forward, key1 will be regarded as expired data and the timer will not be 
> registered. That is to say, the subsequent processWatermark will no longer 
> calculate key1, and data will be lost.
>  
> I have writen a UT to prove this bug:
> window_size 3000, offset 1000, tumping window
> window_size 3000, offset 1000. When processWatermark(3000), the normally 
> calculated nextTriggerProgress = 3000 - (3000-1000)%3000 + 3000-1 = 3999, but 
> because the code does not consider the offset, nextTriggerProgress = 3000 - 
> (3000-0)%3000 + 3000-1 = 5999, which is too large.We will lose data.
> !https://intranetproxy.alipay.com/skylark/lark/0/2024/png/101856358/1730893082766-4ea5d356-2b19-436d-be46-093f70e445cd.png!
> This wrong UT will pass. You can see that the data in the UT is lost and will 
> not be calculated in any subsequent processWatermark.
>  
> Repair suggestion:
> pass window_offset when calculating nextTriggerWatermark
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to