[jira] [Resolved] (FLINK-2356) Resource leak in checkpoint coordinator

Ufuk Celebi (JIRA) Wed, 26 Aug 2015 11:00:12 -0700

     [ 
https://issues.apache.org/jira/browse/FLINK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ufuk Celebi resolved FLINK-2356.
--------------------------------
    Resolution: Fixed

Fixed via 366d937 (master), 3cdbb80 (release-0.9).

> Resource leak in checkpoint coordinator
> ---------------------------------------
>
>                 Key: FLINK-2356
>                 URL: https://issues.apache.org/jira/browse/FLINK-2356
>             Project: Flink
>          Issue Type: Bug
>          Components: JobManager, Streaming
>    Affects Versions: 0.9, master
>            Reporter: Ufuk Celebi
>             Fix For: 0.10, 0.9.1
>
>
> The shutdown method of the checkpoint coordinator is not called when a Flink 
> cluster is shutdown via SIGINT. The issue is that the checkpoint coordinator 
> shutdown/cleanup is only called after the job enters a final state. This does 
> not happen for regular cluster shutdown (via kill). Because we don't have 
> proper stopping of streaming jobs, this means that every program using 
> checkpointing is suffering from this.
> I've tested this only locally for now with a custom WordCount checkpointing 
> the current count. When stopping the process, the files still exist. Since 
> this is the same mechanism as in a distributed setup with HDFS, this should 
> mean that files in HDFS will be lingering around.
> The problem is that the postStop method of the JM actor is not called when 
> shutting down. The task manager components, which need to do resource cleanup 
> register custom shutdown hooks and don't rely on a shutdown call from the 
> task manager.
> For 0.9.1 we need to make sure that the state is simply cleaned up with a 
> shutdown hook (as in the blob manager). For 0.10 with HA we need to be more 
> careful and not clean it up when other job manager instances need access. See 
> FLINK-2354 for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (FLINK-2356) Resource leak in checkpoint coordinator

Reply via email to