[jira] [Commented] (FLINK-2356) Resource leak in checkpoint coordinator

ASF GitHub Bot (JIRA) Wed, 26 Aug 2015 09:46:47 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14714501#comment-14714501
 ]


ASF GitHub Bot commented on FLINK-2356:
---------------------------------------

GitHub user uce opened a pull request:

    https://github.com/apache/flink/pull/1063

    [FLINK-2356] Add shutdown hook to CheckpointCoordinator

    This adds a shutdown hook to shutdown the checkpoint coordinator when the 
JobManager gets a SIGINT.
    
    The implementation is similar to the implementation we have for other 
services, which do clean up via shutdown hooks.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/uce/flink checkpoint-coord-2356-master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1063.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1063
    
----
commit 11acb5a9fd0fc48e0445711a6f2aa18f2aa68d36
Author: Ufuk Celebi <[email protected]>
Date:   2015-08-26T16:03:28Z

    [FLINK-2356] Add shutdown hook to CheckpointCoordinator to prevent resource 
leaks

----


> Resource leak in checkpoint coordinator
> ---------------------------------------
>
>                 Key: FLINK-2356
>                 URL: https://issues.apache.org/jira/browse/FLINK-2356
>             Project: Flink
>          Issue Type: Bug
>          Components: JobManager, Streaming
>    Affects Versions: 0.9, master
>            Reporter: Ufuk Celebi
>             Fix For: 0.10, 0.9.1
>
>
> The shutdown method of the checkpoint coordinator is not called when a Flink 
> cluster is shutdown via SIGINT. The issue is that the checkpoint coordinator 
> shutdown/cleanup is only called after the job enters a final state. This does 
> not happen for regular cluster shutdown (via kill). Because we don't have 
> proper stopping of streaming jobs, this means that every program using 
> checkpointing is suffering from this.
> I've tested this only locally for now with a custom WordCount checkpointing 
> the current count. When stopping the process, the files still exist. Since 
> this is the same mechanism as in a distributed setup with HDFS, this should 
> mean that files in HDFS will be lingering around.
> The problem is that the postStop method of the JM actor is not called when 
> shutting down. The task manager components, which need to do resource cleanup 
> register custom shutdown hooks and don't rely on a shutdown call from the 
> task manager.
> For 0.9.1 we need to make sure that the state is simply cleaned up with a 
> shutdown hook (as in the blob manager). For 0.10 with HA we need to be more 
> careful and not clean it up when other job manager instances need access. See 
> FLINK-2354 for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2356) Resource leak in checkpoint coordinator

Reply via email to