[
https://issues.apache.org/jira/browse/FLINK-5663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15843182#comment-15843182
]
ASF GitHub Bot commented on FLINK-5663:
---------------------------------------
Github user StephanEwen commented on the issue:
https://github.com/apache/flink/pull/3228
Concerning the reaper thread:
- It is not really broken in the release branch, only has a few more
threads then necessary.
- So far we have strived to make sure Flink does not leave any lingering
Threads at all (as validated by the MiniCluster thread), because it actually
messes up testing setups from many users that repeatedly execute programs with
the LocalEnvironment. That would be good to keep.
- One can probably stop the thread when all registries are closed and
re-spawn it when new registries come. That would be
In summary: I think lingering threads are a type of regression, actually.
Introducing that for something that is not broken right now is something I
would not do for a release. Especially given that there is probably a cleaner
solution that can both implement the improvement and not have the lingering
threads regression. Let's do that for the 1.2.1/1.3 release.
> Checkpoint fails because of closed registry
> -------------------------------------------
>
> Key: FLINK-5663
> URL: https://issues.apache.org/jira/browse/FLINK-5663
> Project: Flink
> Issue Type: Bug
> Components: State Backends, Checkpointing
> Reporter: Ufuk Celebi
> Assignee: Stefan Richter
>
> While testing the 1.2.0 release I got the following Exception:
> {code}
> 2017-01-26 17:29:20,602 INFO org.apache.flink.runtime.taskmanager.Task
> - Source: Custom Source (3/8)
> (2dbce778c4e53a39dec3558e868ceef4) switched from RUNNING to FAILED.
> java.lang.Exception: Error while triggering checkpoint 2 for Source: Custom
> Source (3/8)
> at org.apache.flink.runtime.taskmanager.Task$3.run(Task.java:1117)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.Exception: Could not perform checkpoint 2 for operator
> Source: Custom Source (3/8).
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:533)
> at org.apache.flink.runtime.taskmanager.Task$3.run(Task.java:1108)
> ... 5 more
> Caused by: java.lang.Exception: Could not complete snapshot 2 for operator
> Source: Custom Source (3/8).
> at
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:372)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1116)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1052)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:640)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:585)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:528)
> ... 6 more
> Caused by: java.io.IOException: Could not flush and close the file system
> output stream to
> file:/Users/uce/Desktop/1-2-testing/fs/83889867a493a1dc80f6c588c071b679/chk-2/e4415d0d-719c-48df-91a9-3171ba468152
> in order to obtain the stream state handle
> at
> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:333)
> at
> org.apache.flink.runtime.state.DefaultOperatorStateBackend.snapshot(DefaultOperatorStateBackend.java:200)
> at
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:357)
> ... 11 more
> Caused by: java.io.IOException: Could not open output stream for state backend
> at
> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.createStream(FsCheckpointStreamFactory.java:368)
> at
> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.flush(FsCheckpointStreamFactory.java:225)
> at
> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:305)
> ... 13 more
> Caused by: java.io.IOException: Cannot register Closeable, registry is
> already closed. Closing argument.
> at
> org.apache.flink.util.AbstractCloseableRegistry.registerClosable(AbstractCloseableRegistry.java:63)
> at
> org.apache.flink.core.fs.ClosingFSDataOutputStream.wrapSafe(ClosingFSDataOutputStream.java:99)
> at
> org.apache.flink.core.fs.SafetyNetWrapperFileSystem.create(SafetyNetWrapperFileSystem.java:123)
> at
> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.createStream(FsCheckpointStreamFactory.java:359)
> ... 15 more
> {code}
> The job recovered and kept running.
> [[email protected]] Can this be a race with the closable registry?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)