[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

Biao Liu (Jira) Mon, 27 Jul 2020 01:01:12 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17165518#comment-17165518
 ]


Biao Liu commented on FLINK-17073:
----------------------------------

To [~roman_khachatryan], thanks for nice suggestions!

{quote}I think an alternative (or complementary) temporary solution is to use a 
bounded queue when creating ioExecutor.{quote}
I'm not a fan of this temporary solution. We have to consider how to treat the 
invoker which launches asynchronous IO operations through {{ioExecutor}} if the 
queue is full. Make them failed or wait till there is some space available? I'm 
afraid it's not a small work to review all the places calls {{ioExecutor}}. If 
we want a temporary solution, maybe we could just increase the thread count. 

Regarding to the long-term solution. Actually Etienne and me have not discuss 
many of the implementation details. I just gave some suggestions to make sure 
it's in the right direction. It's cool to have your detailed suggestions. It 
may help a lot for the contributor who is not familiar with this part. I just 
thought we don't have to discuss too much details here. It might be better to 
give contributor more free space. We could pay more attention on code review to 
guarantee it's correct and reasonable.

BTW, just a tiny suggestion, code refactoring is not necessary, we should focus 
on solving the issue first. After that, we could consider if we could do some 
refactoring to make the codes more readable or elegant. 

To [~echauchot], besides the implementation, is there any question about the 
plan? Please feel free to ask anything that you don't understand. 

> Slow checkpoint cleanup causing OOMs
> ------------------------------------
>
>                 Key: FLINK-17073
>                 URL: https://issues.apache.org/jira/browse/FLINK-17073
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>            Reporter: Till Rohrmann
>            Assignee: Etienne Chauchot
>            Priority: Major
>             Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

Reply via email to