[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164623#comment-17164623
 ] 

Etienne Chauchot commented on FLINK-17073:
------------------------------------------

Hi [~roman_khachatryan],

thanks for the suggestions! Overall this is what I intended to do modulo these 
minor things:

In 1. I meant to use the _CompletedCheckpoints_ dequeue to keep track of the 
checkpoints to clean and avoid adding a new queue

In 2. Yes indeed, it needs to call _executeQueuedRequest,_ but it needs also 
not to call it when the previous checkpoint cleaning is not done (still need to 
figure out how to sync cleaning with work in CheckpointCoordinator) so that 
checkpoint cleaning becomes part of the checkpoint process and not a side 
fire-and-forget process. This behavior will be configurable to avoid lowering 
checkpoint rate when CP cleaning rate is not a problem. Once cleaning is part 
of the standard checkpointing process, checking in flight checkpoints will tell 
how many potential cleaning checkpoints there are and if there are too much, 
drop any new CP trigger request.

In 3. yes that is what I meant in "drop any new CP trigger request" above.

In 4.  I'm not clear yet about concurrency in checkpointing.  

> Slow checkpoint cleanup causing OOMs
> ------------------------------------
>
>                 Key: FLINK-17073
>                 URL: https://issues.apache.org/jira/browse/FLINK-17073
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>            Reporter: Till Rohrmann
>            Assignee: Etienne Chauchot
>            Priority: Major
>             Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to