[ 
https://issues.apache.org/jira/browse/FLINK-37399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-37399:
-----------------------------------
    Labels: pull-request-available  (was: )

> Watermark alignment can prevent backlogged jobs from using all available 
> resources
> ----------------------------------------------------------------------------------
>
>                 Key: FLINK-37399
>                 URL: https://issues.apache.org/jira/browse/FLINK-37399
>             Project: Flink
>          Issue Type: Improvement
>          Components: API / DataStream
>    Affects Versions: 2.0.0, 1.18.1, 1.19.2
>            Reporter: Piotr Nowojski
>            Assignee: Piotr Nowojski
>            Priority: Major
>              Labels: pull-request-available
>
> Imagine the following scenario. Max allowed watermark alignment drift is set 
> to 30s and watermark alignment is synced/announced across subtasks every 1s.  
> This makes perfect sense during normal records processing.
> But when processing messages from backlog, it creates a problem, defacto 
> capping how quickly watermarks can be progressing. For example:
> * at {{t0}} we announce that max allowed watermark is {{max_w0 = 
> min(reported_watermark) + 30s}}
> * next announcement will happen at {{t1 = t0 + 1s}}
> * any source/partition that exceeds {{max_w0}} before the next announcement 
> happens at {{t1}}, will be blocked by the watermark alignment
> In other words, with the above configuration (30s allowed drift announced 
> every ~1s), we are de facto capping the backlog processing speed to:
> *30 "event time" seconds per each 1 "real world" second*
> The problem is the delay in how quickly watermark announcements are 
> propagating between operators and the coordinator. By the time operator 
> receives what is max allowed watermark, that value is already ~1s old. 
> Symptoms of this problem are:
> * non empty backlog/pending records
> * job is under utilised, with all subtasks being at least partially idle (for 
> example 50% idle)
> In other words, there are records to be processed from the backlog, job has 
> resources to process more records more quickly, but it doesn't because 
> watermark alignment is blocking the progress, constantly pausing/resuming all 
> of the splits. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to