[
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084149#comment-17084149
]
Yun Tang commented on FLINK-17073:
----------------------------------
I prefer to the configuration solution to keep the behavior the same as before.
Current Flink architecture cannot totally prevent this problem if the speed of
creating checkpoint larger than the speed of deleting previous checkpoints.
Increase the pool size could not prevent this but only mitigate the possibility.
> Slow checkpoint cleanup causing OOMs
> ------------------------------------
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
> Reporter: Till Rohrmann
> Priority: Critical
> Fix For: 1.9.3, 1.10.1, 1.11.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as
> CPU cores. This change might have caused the decline in completed checkpoint
> discard throughput. This suspicion needs to be validated before trying to fix
> it!
> [1]
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E
--
This message was sent by Atlassian Jira
(v8.3.4#803005)