[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

Biao Liu (Jira) Mon, 03 Aug 2020 05:55:37 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170015#comment-17170015
 ]


Biao Liu commented on FLINK-17073:
----------------------------------

[~echauchot], sorry for the late reply. Thanks for pushing this!

I'm OK with [~roman_khachatryan]'s plan. It's simpler to implement in some 
aspects indeed. In my plan, we have to consider how to avoid synchronous 
cleaning which you mentioned. Because in the near future, 
{{CheckpointCoordinator}} would be no big lock anymore. 

{quote}...we can drop new checkpoint requests when there are too many 
checkpoints to clean...{quote}
I think we should take care of the cleaning for both successful checkpoint and 
failed checkpoint. 

I have left some comments in the doc.

> Slow checkpoint cleanup causing OOMs
> ------------------------------------
>
>                 Key: FLINK-17073
>                 URL: https://issues.apache.org/jira/browse/FLINK-17073
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>            Reporter: Till Rohrmann
>            Assignee: Etienne Chauchot
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

Reply via email to