[ 
https://issues.apache.org/jira/browse/FLINK-32554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn Visser updated FLINK-32554:
-----------------------------------
    Component/s: API / Core
                 Connectors / Common

> Facilitate slot isolation and resource management for global committer
> ----------------------------------------------------------------------
>
>                 Key: FLINK-32554
>                 URL: https://issues.apache.org/jira/browse/FLINK-32554
>             Project: Flink
>          Issue Type: Improvement
>          Components: API / Core, Connectors / Common
>    Affects Versions: 1.16.2
>            Reporter: Allen Wang
>            Priority: Major
>
> Flink's global committer executes unique workload compared to the source and 
> sink operators. In some use cases, it may require much higher amount of 
> resources (CPU, memory) than other operators. However, according to this 
> [source 
> code|https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/connector/sink2/StandardSinkTopologies.java],
>  currently it is not possible to isolate the global committer to a dedicated 
> task manager or task slot, or assign more resources to it by leveraging the 
> fine grained resource management. Flink would always make the global 
> committer task share with another task in a task slot. (In one test, we tried 
> to have one more task slot than required by the source/sink parallelism, but 
> Flink still assigns the global committer to share a slot with another task.)
> As a result, we often see CPU utilization spike on the task manger that runs 
> the global committer compared with other task managers and becomes the 
> bottleneck for the job. Due to slot sharing and inadequate resources on the 
> global committer, the job takes long time to initialize upon restarting and 
> the checkpoints take long time to complete. Our job consumes from Kafka and 
> this bottleneck causes significant increase of consumer lag. The lag in turn 
> causes the Kafka source operator to replay backlogs, causing more CPU 
> consumption on the source operator and making it worse for the global 
> committer that runs in the same task slot.
> At minimum, we want the capability to configure the global committer to run 
> in its own task slot, and make that work under reactive scaling. It would 
> also be great to make the fine grained resource management working for global 
> committer.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to