[jira] [Comment Edited] (FLINK-6537) Umbrella issue for fixes to incremental snapshots

Stefan Richter (JIRA) Tue, 23 May 2017 01:34:54 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-6537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16020789#comment-16020789
 ]


Stefan Richter edited comment on FLINK-6537 at 5/23/17 8:33 AM:
----------------------------------------------------------------

I had a look at those logs, but they seem strange to me. From the exceptions, 
the root cause seems to be that the {{SafetyNetWrapperFileSystem}} has a 
registry that is already closed, which can only happen in two places in 
{{Task}}, one correlated with checkpointing in line 834. We are using a thread 
local variable to hold the registry, but this variable is re-initialized with a 
fresh registry for each checkpoint runnable. Since we are not using 
{{InheritableThreadLocal}}, there should be no leaking to other threads, e.g. 
through threadpools. From the log, I also cannot see the precondition fail that 
indicates that there is already another registry in place. So right now, I only 
see the explanation that the an old, closed registry is leaked somehow, but 
cannot see how this is possible. For debugging purposes, what we could do is 
print the reference to the created, closed, and used registries that wrap the 
streams to check if an old registry somehow survives where it should not.

I also tried to reproduce the problem, but unfortunately could never see this 
in any test runs. Is there something very special about your setup or are you 
using a customized version of Flink? For a test run, you could also deactivate 
the safety net in {{FileSystem}} line 389 and check if this solves all the 
problem. Still wonder how this actually can cause troubles, in particular as 
this is already a Flink 1.2 feature :(

Edit: This should actually also cause savepoints with all other backends to 
fail?
Edit 2: Ok, one way for this to happen is that the FileSystem is cached 
somewhere, where it shouldn't (in particular: without previously unwrapping it).


was (Author: srichter):
I had a look at those logs, but they seem strange to me. From the exceptions, 
the root cause seems to be that the {{SafetyNetWrapperFileSystem}} has a 
registry that is already closed, which can only happen in two places in 
{{Task}}, one correlated with checkpointing in line 834. We are using a thread 
local variable to hold the registry, but this variable is re-initialized with a 
fresh registry for each checkpoint runnable. Since we are not using 
{{InheritableThreadLocal}}, there should be no leaking to other threads, e.g. 
through threadpools. From the log, I also cannot see the precondition fail that 
indicates that there is already another registry in place. So right now, I only 
see the explanation that the an old, closed registry is leaked somehow, but 
cannot see how this is possible. For debugging purposes, what we could do is 
print the reference to the created, closed, and used registries that wrap the 
streams to check if an old registry somehow survives where it should not.

I also tried to reproduce the problem, but unfortunately could never see this 
in any test runs. Is there something very special about your setup or are you 
using a customized version of Flink? For a test run, you could also deactivate 
the safety net in {{FileSystem}} line 389 and check if this solves all the 
problem. Still wonder how this actually can cause troubles, in particular as 
this is already a Flink 1.2 feature :(

Edit: This should actually also cause savepoints with all other backends to 
fail?

> Umbrella issue for fixes to incremental snapshots
> -------------------------------------------------
>
>                 Key: FLINK-6537
>                 URL: https://issues.apache.org/jira/browse/FLINK-6537
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.3.0
>            Reporter: Stefan Richter
>            Assignee: Stefan Richter
>             Fix For: 1.3.0
>
>
> This issue tracks ongoing fixes in the incremental checkpointing feature for 
> the 1.3 release.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (FLINK-6537) Umbrella issue for fixes to incremental snapshots

Reply via email to