[
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108270#comment-17108270
]
Etienne Chauchot commented on FLINK-17073:
------------------------------------------
Hi, I just started contributing to Flink. I'd like to take a look at this
subject as it can be a good introduction to Flink architecture IMHO. As the
temporary workaround discussed above has been implemented
[here|[https://github.com/apache/flink/pull/11957]] maybe it is time to tackle
the above subject. One thing I wonder is: if we want to limit the number of
CompletedCheckpoints submitted to the IOExecutor for cleaning, what happens if
_ZooKeeperCompletedCheckpointStore_ tries to submit a new CompletedCheckpoint
when the limit has already been reached ? Shall it delay the submission waiting
for the current number of submitted tasks to decrease?
> Slow checkpoint cleanup causing OOMs
> ------------------------------------
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
> Reporter: Till Rohrmann
> Priority: Major
> Fix For: 1.11.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as
> CPU cores. This change might have caused the decline in completed checkpoint
> discard throughput. This suspicion needs to be validated before trying to fix
> it!
> [1]
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E
--
This message was sent by Atlassian Jira
(v8.3.4#803005)