[jira] [Commented] (FLINK-7783) Don't always remove checkpoints in ZooKeeperCompletedCheckpointStore#recover()

ASF GitHub Bot (JIRA) Thu, 19 Oct 2017 23:07:08 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-7783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212228#comment-16212228
 ]


ASF GitHub Bot commented on FLINK-7783:
---------------------------------------

Github user StefanRRichter commented on the issue:

    https://github.com/apache/flink/pull/4863
  
    @aljoscha from the top of my head, I think some concerns are invalid. After 
being added to to the checkpoint store, the incremental handles should be 
complete and self contained - all placeholder handles are replaced by the real 
handles. What is, in fact, build up incrementally is the reference count for 
each sst-file handle when we re-register on recovery. So if we do not register 
a handle, it does not account for the reference counting, which might be 
correct if we assume the handle is "dead". However, we should not call 
`discardState()` on an unregistered handle because if they are unregistered, 
they are assumed to still own their handles and will wipe all of them without 
considering other references in the registry. So for most parts, we can simply 
rely on the remaining handles building up a reasonable count, except for 
handles that only still occurred in the broken incremental checkpoints. Those 
are leaking because we lost any way to properly delete them. But I think this 
kind of leak always existed in the implementation, because we cannot call 
dispose if we cannot retrieve the checkpoint.
    
    To sum up: it should work, there is a leak to be aware of, but that problem 
always existed.


> Don't always remove checkpoints in ZooKeeperCompletedCheckpointStore#recover()
> ------------------------------------------------------------------------------
>
>                 Key: FLINK-7783
>                 URL: https://issues.apache.org/jira/browse/FLINK-7783
>             Project: Flink
>          Issue Type: Sub-task
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.4.0, 1.3.2
>            Reporter: Aljoscha Krettek
>            Assignee: Aljoscha Krettek
>            Priority: Blocker
>             Fix For: 1.4.0, 1.3.3
>
>
> Currently, we always delete checkpoint handles if they (or the data from the 
> DFS) cannot be read: 
> https://github.com/apache/flink/blob/91a4b276171afb760bfff9ccf30593e648e91dfb/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L180
> This can lead to problems in case the DFS is temporarily not available, i.e. 
> we could inadvertently
> delete all checkpoints even though they are still valid.
> A user reported this problem on the mailing list: 
> https://lists.apache.org/thread.html/9dc9b719cf8449067ad01114fedb75d1beac7b4dff171acdcc24903d@%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (FLINK-7783) Don't always remove checkpoints in ZooKeeperCompletedCheckpointStore#recover()

Reply via email to