[
https://issues.apache.org/jira/browse/FLINK-32554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Martijn Visser updated FLINK-32554:
-----------------------------------
Component/s: API / Core
Connectors / Common
> Facilitate slot isolation and resource management for global committer
> ----------------------------------------------------------------------
>
> Key: FLINK-32554
> URL: https://issues.apache.org/jira/browse/FLINK-32554
> Project: Flink
> Issue Type: Improvement
> Components: API / Core, Connectors / Common
> Affects Versions: 1.16.2
> Reporter: Allen Wang
> Priority: Major
>
> Flink's global committer executes unique workload compared to the source and
> sink operators. In some use cases, it may require much higher amount of
> resources (CPU, memory) than other operators. However, according to this
> [source
> code|https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/connector/sink2/StandardSinkTopologies.java],
> currently it is not possible to isolate the global committer to a dedicated
> task manager or task slot, or assign more resources to it by leveraging the
> fine grained resource management. Flink would always make the global
> committer task share with another task in a task slot. (In one test, we tried
> to have one more task slot than required by the source/sink parallelism, but
> Flink still assigns the global committer to share a slot with another task.)
> As a result, we often see CPU utilization spike on the task manger that runs
> the global committer compared with other task managers and becomes the
> bottleneck for the job. Due to slot sharing and inadequate resources on the
> global committer, the job takes long time to initialize upon restarting and
> the checkpoints take long time to complete. Our job consumes from Kafka and
> this bottleneck causes significant increase of consumer lag. The lag in turn
> causes the Kafka source operator to replay backlogs, causing more CPU
> consumption on the source operator and making it worse for the global
> committer that runs in the same task slot.
> At minimum, we want the capability to configure the global committer to run
> in its own task slot, and make that work under reactive scaling. It would
> also be great to make the fine grained resource management working for global
> committer.
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)