[
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164623#comment-17164623
]
Etienne Chauchot commented on FLINK-17073:
------------------------------------------
Hi [~roman_khachatryan],
thanks for the suggestions! Overall this is what I intended to do modulo these
minor things:
In 1. I meant to use the _CompletedCheckpoints_ dequeue to keep track of the
checkpoints to clean and avoid adding a new queue
In 2. Yes indeed, it needs to call _executeQueuedRequest,_ but it needs also
not to call it when the previous checkpoint cleaning is not done (still need to
figure out how to sync cleaning with work in CheckpointCoordinator) so that
checkpoint cleaning becomes part of the checkpoint process and not a side
fire-and-forget process. This behavior will be configurable to avoid lowering
checkpoint rate when CP cleaning rate is not a problem. Once cleaning is part
of the standard checkpointing process, checking in flight checkpoints will tell
how many potential cleaning checkpoints there are and if there are too much,
drop any new CP trigger request.
In 3. yes that is what I meant in "drop any new CP trigger request" above.
In 4. I'm not clear yet about concurrency in checkpointing.
> Slow checkpoint cleanup causing OOMs
> ------------------------------------
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
> Reporter: Till Rohrmann
> Assignee: Etienne Chauchot
> Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as
> CPU cores. This change might have caused the decline in completed checkpoint
> discard throughput. This suspicion needs to be validated before trying to fix
> it!
> [1]
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E
--
This message was sent by Atlassian Jira
(v8.3.4#803005)