[ https://issues.apache.org/jira/browse/FLINK-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15140602#comment-15140602 ]
ASF GitHub Bot commented on FLINK-3260: --------------------------------------- Github user StephanEwen commented on a diff in the pull request: https://github.com/apache/flink/pull/1613#discussion_r52438500 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java --- @@ -107,7 +108,7 @@ private static final AtomicReferenceFieldUpdater<Execution, ExecutionState> STATE_UPDATER = AtomicReferenceFieldUpdater.newUpdater(Execution.class, ExecutionState.class, "state"); - private static final Logger LOG = ExecutionGraph.LOG; + private static final Logger LOG = LoggerFactory.getLogger(Execution.class); --- End diff -- I think that is to some extend matter of taste how fine grained you split the loggers. Also, actually, if you create loggers by class or by job. If you don't mind keeping it, I would like to keep it. > ExecutionGraph gets stuck in state FAILING > ------------------------------------------ > > Key: FLINK-3260 > URL: https://issues.apache.org/jira/browse/FLINK-3260 > Project: Flink > Issue Type: Bug > Components: JobManager > Affects Versions: 0.10.1 > Reporter: Stephan Ewen > Assignee: Till Rohrmann > Priority: Blocker > Fix For: 1.0.0 > > > It is a bit of a rare case, but the following can currently happen: > 1. Jobs runs for a while, some tasks are already finished. > 2. Job fails, goes to state failing and restarting. Non-finished tasks fail > or are canceled. > 3. For the finished tasks, ask-futures from certain messages (for example > for releasing intermediate result partitions) can fail (timeout) and cause > the execution to go from FINISHED to FAILED > 4. This triggers the execution graph to go to FAILING without ever going > further into RESTARTING again > 5. The job is stuck > It initially looks like this is mainly an issue for batch jobs (jobs where > tasks do finish, rather than run infinitely). > The log that shows how this manifests: > {code} > -------------------------------------------------------------------------------- > 17:19:19,782 INFO akka.event.slf4j.Slf4jLogger > - Slf4jLogger started > 17:19:19,844 INFO Remoting > - Starting remoting > 17:19:20,065 INFO Remoting > - Remoting started; listening on addresses > :[akka.tcp://flink@127.0.0.1:56722] > 17:19:20,090 INFO org.apache.flink.runtime.blob.BlobServer > - Created BLOB server storage directory > /tmp/blobStore-6766f51a-1c51-4a03-acfb-08c2c29c11f0 > 17:19:20,096 INFO org.apache.flink.runtime.blob.BlobServer > - Started BLOB server at 0.0.0.0:43327 - max concurrent requests: 50 - max > backlog: 1000 > 17:19:20,113 INFO org.apache.flink.runtime.jobmanager.MemoryArchivist > - Started memory archivist akka://flink/user/archive > 17:19:20,115 INFO org.apache.flink.runtime.checkpoint.SavepointStoreFactory > - No savepoint state backend configured. Using job manager savepoint state > backend. > 17:19:20,118 INFO org.apache.flink.runtime.jobmanager.JobManager > - Starting JobManager at akka.tcp://flink@127.0.0.1:56722/user/jobmanager. > 17:19:20,123 INFO org.apache.flink.runtime.jobmanager.JobManager > - JobManager akka.tcp://flink@127.0.0.1:56722/user/jobmanager was granted > leadership with leader session ID None. > 17:19:25,605 INFO org.apache.flink.runtime.instance.InstanceManager > - Registered TaskManager at > testing-worker-linux-docker-e6d6931f-3200-linux-4 > (akka.tcp://flink@172.17.0.253:43702/user/taskmanager) as > f213232054587f296a12140d56f63ed1. Current number of registered hosts is 1. > Current number of alive task slots is 2. > 17:19:26,758 INFO org.apache.flink.runtime.instance.InstanceManager > - Registered TaskManager at > testing-worker-linux-docker-e6d6931f-3200-linux-4 > (akka.tcp://flink@172.17.0.253:43956/user/taskmanager) as > f9e78baa14fb38c69517fb1bcf4f419c. Current number of registered hosts is 2. > Current number of alive task slots is 4. > 17:19:27,064 INFO org.apache.flink.api.java.ExecutionEnvironment > - The job has 0 registered types and 0 default Kryo serializers > 17:19:27,071 INFO org.apache.flink.client.program.Client > - Starting client actor system > 17:19:27,072 INFO org.apache.flink.runtime.client.JobClient > - Starting JobClient actor system > 17:19:27,110 INFO akka.event.slf4j.Slf4jLogger > - Slf4jLogger started > 17:19:27,121 INFO Remoting > - Starting remoting > 17:19:27,143 INFO org.apache.flink.runtime.client.JobClient > - Started JobClient actor system at 127.0.0.1:51198 > 17:19:27,145 INFO Remoting > - Remoting started; listening on addresses > :[akka.tcp://flink@127.0.0.1:51198] > 17:19:27,325 INFO org.apache.flink.runtime.client.JobClientActor > - Disconnect from JobManager null. > 17:19:27,362 INFO org.apache.flink.runtime.client.JobClientActor > - Received job Flink Java Job at Mon Jan 18 17:19:27 UTC 2016 > (fa05fd25993a8742da09cc5023c1e38d). > 17:19:27,362 INFO org.apache.flink.runtime.client.JobClientActor > - Could not submit job Flink Java Job at Mon Jan 18 17:19:27 UTC 2016 > (fa05fd25993a8742da09cc5023c1e38d), because there is no connection to a > JobManager. > 17:19:27,379 INFO org.apache.flink.runtime.client.JobClientActor > - Connect to JobManager > Actor[akka.tcp://flink@127.0.0.1:56722/user/jobmanager#-1489998809]. > 17:19:27,379 INFO org.apache.flink.runtime.client.JobClientActor > - Connected to new JobManager > akka.tcp://flink@127.0.0.1:56722/user/jobmanager. > 17:19:27,379 INFO org.apache.flink.runtime.client.JobClientActor > - Sending message to JobManager > akka.tcp://flink@127.0.0.1:56722/user/jobmanager to submit job Flink Java Job > at Mon Jan 18 17:19:27 UTC 2016 (fa05fd25993a8742da09cc5023c1e38d) and wait > for progress > 17:19:27,380 INFO org.apache.flink.runtime.client.JobClientActor > - Upload jar files to job manager > akka.tcp://flink@127.0.0.1:56722/user/jobmanager. > 17:19:27,380 INFO org.apache.flink.runtime.client.JobClientActor > - Submit job to the job manager > akka.tcp://flink@127.0.0.1:56722/user/jobmanager. > 17:19:27,453 INFO org.apache.flink.runtime.jobmanager.JobManager > - Submitting job fa05fd25993a8742da09cc5023c1e38d (Flink Java Job at Mon > Jan 18 17:19:27 UTC 2016). > 17:19:27,591 INFO org.apache.flink.runtime.jobmanager.JobManager > - Scheduling job fa05fd25993a8742da09cc5023c1e38d (Flink Java Job at Mon > Jan 18 17:19:27 UTC 2016). > 17:19:27,592 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (1/4) > (c79bf4381462c690f5999f2d1949ab50) switched from CREATED to SCHEDULED > 17:19:27,596 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (1/4) > (c79bf4381462c690f5999f2d1949ab50) switched from SCHEDULED to DEPLOYING > 17:19:27,597 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - Deploying DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (1/4) (attempt > #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4 > 17:19:27,606 INFO org.apache.flink.runtime.client.JobClientActor > - Job was successfully submitted to the JobManager > akka.tcp://flink@127.0.0.1:56722/user/jobmanager. > 17:19:27,630 INFO org.apache.flink.runtime.jobmanager.JobManager > - Status of job fa05fd25993a8742da09cc5023c1e38d (Flink Java Job at Mon > Jan 18 17:19:27 UTC 2016) changed to RUNNING. > 17:19:27,637 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (2/4) > (e73af91028cb76f7d3cd887cb6d66755) switched from CREATED to SCHEDULED > 17:19:27,654 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:27 Job execution switched to status RUNNING. > 17:19:27,655 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:27 DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(1/4) switched to > SCHEDULED > 17:19:27,656 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:27 DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(1/4) switched to > DEPLOYING > 17:19:27,666 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (2/4) > (e73af91028cb76f7d3cd887cb6d66755) switched from SCHEDULED to DEPLOYING > 17:19:27,667 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - Deploying DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (2/4) (attempt > #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4 > 17:19:27,667 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:27 DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(2/4) switched to > SCHEDULED > 17:19:27,669 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:27 DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(2/4) switched to > DEPLOYING > 17:19:27,681 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (3/4) > (807daf978da9dc347dca930822c78f8f) switched from CREATED to SCHEDULED > 17:19:27,682 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (3/4) > (807daf978da9dc347dca930822c78f8f) switched from SCHEDULED to DEPLOYING > 17:19:27,682 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - Deploying DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (3/4) (attempt > #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4 > 17:19:27,682 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (4/4) > (ba45c37065b67fc8f5005a50d0e88fff) switched from CREATED to SCHEDULED > 17:19:27,682 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (4/4) > (ba45c37065b67fc8f5005a50d0e88fff) switched from SCHEDULED to DEPLOYING > 17:19:27,685 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - Deploying DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (4/4) (attempt > #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4 > 17:19:27,686 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:27 DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(3/4) switched to > SCHEDULED > 17:19:27,687 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:27 DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(3/4) switched to > DEPLOYING > 17:19:27,687 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:27 DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(4/4) switched to > SCHEDULED > 17:19:27,692 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:27 DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(4/4) switched to > DEPLOYING > 17:19:27,833 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (4/4) > (ba45c37065b67fc8f5005a50d0e88fff) switched from DEPLOYING to RUNNING > 17:19:27,839 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:27 DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(4/4) switched to > RUNNING > 17:19:27,840 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (2/4) > (e73af91028cb76f7d3cd887cb6d66755) switched from DEPLOYING to RUNNING > 17:19:27,852 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:27 DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(2/4) switched to > RUNNING > 17:19:27,896 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (1/4) > (c79bf4381462c690f5999f2d1949ab50) switched from DEPLOYING to RUNNING > 17:19:27,898 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (3/4) > (807daf978da9dc347dca930822c78f8f) switched from DEPLOYING to RUNNING > 17:19:27,901 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:27 DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(1/4) switched to > RUNNING > 17:19:27,905 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:27 DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(3/4) switched to > RUNNING > 17:19:28,114 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (3/4) (7997918330ecf2610b3298a8c8ef2852) switched from CREATED to SCHEDULED > 17:19:28,126 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (1/4) (6421c8f88b191ea844619a40a523773b) switched from CREATED to SCHEDULED > 17:19:28,134 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (1/4) (6421c8f88b191ea844619a40a523773b) switched from SCHEDULED to DEPLOYING > 17:19:28,134 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - Deploying CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (1/4) (attempt #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4 > 17:19:28,126 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (2/4) (d0d011dc0a0823bcec5a57a369b334ed) switched from CREATED to SCHEDULED > 17:19:28,139 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (2/4) (d0d011dc0a0823bcec5a57a369b334ed) switched from SCHEDULED to DEPLOYING > 17:19:28,139 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - Deploying CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (2/4) (attempt #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4 > 17:19:28,117 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (4/4) (c928d19f73d700e80cdfad650689febb) switched from CREATED to SCHEDULED > 17:19:28,134 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (3/4) (7997918330ecf2610b3298a8c8ef2852) switched from SCHEDULED to DEPLOYING > 17:19:28,140 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - Deploying CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (3/4) (attempt #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4 > 17:19:28,140 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (4/4) (c928d19f73d700e80cdfad650689febb) switched from SCHEDULED to DEPLOYING > 17:19:28,141 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - Deploying CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (4/4) (attempt #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4 > 17:19:28,147 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(3/4) > switched to SCHEDULED > 17:19:28,153 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/4) > switched to SCHEDULED > 17:19:28,153 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/4) > switched to DEPLOYING > 17:19:28,153 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(2/4) > switched to SCHEDULED > 17:19:28,153 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(2/4) > switched to DEPLOYING > 17:19:28,156 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(3/4) > switched to DEPLOYING > 17:19:28,158 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(4/4) > switched to SCHEDULED > 17:19:28,165 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(4/4) > switched to DEPLOYING > 17:19:28,238 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (2/4) > (e73af91028cb76f7d3cd887cb6d66755) switched from RUNNING to FINISHED > 17:19:28,242 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:28 DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(2/4) switched to > FINISHED > 17:19:28,308 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (3/4) > (807daf978da9dc347dca930822c78f8f) switched from RUNNING to FINISHED > 17:19:28,315 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (1/4) > (c79bf4381462c690f5999f2d1949ab50) switched from RUNNING to FINISHED > 17:19:28,317 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:28 DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(3/4) switched to > FINISHED > 17:19:28,318 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:28 DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(1/4) switched to > FINISHED > 17:19:28,328 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (1/4) (6421c8f88b191ea844619a40a523773b) switched from DEPLOYING to RUNNING > 17:19:28,336 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/4) > switched to RUNNING > 17:19:28,338 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (3/4) (7997918330ecf2610b3298a8c8ef2852) switched from DEPLOYING to RUNNING > 17:19:28,341 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(3/4) > switched to RUNNING > 17:19:28,459 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (4/4) > (ba45c37065b67fc8f5005a50d0e88fff) switched from RUNNING to FINISHED > 17:19:28,463 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:28 DataSource (at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) > (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(4/4) switched to > FINISHED > 17:19:28,520 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (4/4) (c928d19f73d700e80cdfad650689febb) switched from DEPLOYING to RUNNING > 17:19:28,529 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(4/4) > switched to RUNNING > 17:19:28,540 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (2/4) (d0d011dc0a0823bcec5a57a369b334ed) switched from DEPLOYING to RUNNING > 17:19:28,545 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(2/4) > switched to RUNNING > 17:19:32,384 INFO org.apache.flink.runtime.instance.InstanceManager > - Registered TaskManager at > testing-worker-linux-docker-e6d6931f-3200-linux-4 > (akka.tcp://flink@172.17.0.253:60852/user/taskmanager) as > 5848d44035a164a0302da6c8701ff748. Current number of registered hosts is 3. > Current number of alive task slots is 6. > 17:19:32,598 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - Reduce (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (1/1) (d0f8f69f9047c3154b860850955de20f) switched from CREATED to SCHEDULED > 17:19:32,598 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - Reduce (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (1/1) (d0f8f69f9047c3154b860850955de20f) switched from SCHEDULED to DEPLOYING > 17:19:32,598 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - Deploying Reduce (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (1/1) (attempt #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4 > 17:19:32,605 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:32 Reduce (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/1) > switched to SCHEDULED > 17:19:32,605 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:32 Reduce (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/1) > switched to DEPLOYING > 17:19:32,611 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (4/4) (c928d19f73d700e80cdfad650689febb) switched from RUNNING to FINISHED > 17:19:32,614 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:32 CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(4/4) > switched to FINISHED > 17:19:32,717 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (1/4) (6421c8f88b191ea844619a40a523773b) switched from RUNNING to FINISHED > 17:19:32,719 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:32 CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/4) > switched to FINISHED > 17:19:32,724 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - Reduce (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (1/1) (d0f8f69f9047c3154b860850955de20f) switched from DEPLOYING to RUNNING > 17:19:32,726 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:32 Reduce (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/1) > switched to RUNNING > 17:19:32,843 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (2/4) (d0d011dc0a0823bcec5a57a369b334ed) switched from RUNNING to FINISHED > 17:19:32,845 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:32 CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(2/4) > switched to FINISHED > 17:19:33,092 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system [akka.tcp://flink@172.17.0.253:43702] has > failed, address is now gated for [5000] ms. Reason is: [Disassociated]. > 17:19:39,111 WARN Remoting > - Tried to associate with unreachable remote address > [akka.tcp://flink@172.17.0.253:43702]. Address is now gated for 5000 ms, all > messages to this address will be delivered to dead letters. Reason: > Connection refused: /172.17.0.253:43702 > 17:19:39,113 INFO org.apache.flink.runtime.jobmanager.JobManager > - Task manager akka.tcp://flink@172.17.0.253:43702/user/taskmanager > terminated. > 17:19:39,114 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (3/4) (7997918330ecf2610b3298a8c8ef2852) switched from RUNNING to FAILED > 17:19:39,120 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:39 CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(3/4) > switched to FAILED > java.lang.Exception: The slot in which the task was executed has been > released. Probably loss of TaskManager f213232054587f296a12140d56f63ed1 @ > testing-worker-linux-docker-e6d6931f-3200-linux-4 - 2 slots - URL: > akka.tcp://flink@172.17.0.253:43702/user/taskmanager > at > org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151) > at > org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) > at > org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) > at > org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156) > at > org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:792) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) > at > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > at > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at > akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) > at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) > at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) > at akka.actor.ActorCell.invoke(ActorCell.scala:486) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) > at akka.dispatch.Mailbox.run(Mailbox.scala:221) > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 17:19:39,129 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - Reduce (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (1/1) (d0f8f69f9047c3154b860850955de20f) switched from RUNNING to CANCELING > 17:19:39,132 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - DataSink (collect()) (1/1) (895e1ea552281a665ae390c966cdb3b7) switched > from CREATED to CANCELED > 17:19:39,149 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:39 Job execution switched to status FAILING. > java.lang.Exception: The slot in which the task was executed has been > released. Probably loss of TaskManager f213232054587f296a12140d56f63ed1 @ > testing-worker-linux-docker-e6d6931f-3200-linux-4 - 2 slots - URL: > akka.tcp://flink@172.17.0.253:43702/user/taskmanager > at > org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151) > at > org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) > at > org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) > at > org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156) > at > org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:792) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) > at > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > at > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at > akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) > at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) > at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) > at akka.actor.ActorCell.invoke(ActorCell.scala:486) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) > at akka.dispatch.Mailbox.run(Mailbox.scala:221) > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 17:19:39,173 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:39 Reduce (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/1) > switched to CANCELING > 17:19:39,173 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:39 DataSink (collect())(1/1) switched to > CANCELED > 17:19:39,174 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - Reduce (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (1/1) (d0f8f69f9047c3154b860850955de20f) switched from CANCELING to FAILED > 17:19:39,177 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:39 Reduce (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/1) > switched to FAILED > java.lang.Exception: The slot in which the task was executed has been > released. Probably loss of TaskManager f213232054587f296a12140d56f63ed1 @ > testing-worker-linux-docker-e6d6931f-3200-linux-4 - 2 slots - URL: > akka.tcp://flink@172.17.0.253:43702/user/taskmanager > at > org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151) > at > org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) > at > org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) > at > org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156) > at > org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:792) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) > at > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > at > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at > akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) > at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) > at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) > at akka.actor.ActorCell.invoke(ActorCell.scala:486) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) > at akka.dispatch.Mailbox.run(Mailbox.scala:221) > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 17:19:39,179 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:39 Job execution switched to status RESTARTING. > 17:19:39,179 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - Delaying retry of job execution for 10000 ms ... > 17:19:39,179 INFO org.apache.flink.runtime.instance.InstanceManager > - Unregistered task manager > akka.tcp://flink@172.17.0.253:43702/user/taskmanager. Number of registered > task managers 2. Number of available slots 4. > 17:19:39,179 INFO org.apache.flink.runtime.jobmanager.JobManager > - Status of job fa05fd25993a8742da09cc5023c1e38d (Flink Java Job at Mon > Jan 18 17:19:27 UTC 2016) changed to FAILING. > java.lang.Exception: The slot in which the task was executed has been > released. Probably loss of TaskManager f213232054587f296a12140d56f63ed1 @ > testing-worker-linux-docker-e6d6931f-3200-linux-4 - 2 slots - URL: > akka.tcp://flink@172.17.0.253:43702/user/taskmanager > at > org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151) > at > org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) > at > org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) > at > org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156) > at > org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:792) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) > at > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > at > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at > akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) > at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) > at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) > at akka.actor.ActorCell.invoke(ActorCell.scala:486) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) > at akka.dispatch.Mailbox.run(Mailbox.scala:221) > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 17:19:39,180 INFO org.apache.flink.runtime.jobmanager.JobManager > - Status of job fa05fd25993a8742da09cc5023c1e38d (Flink Java Job at Mon > Jan 18 17:19:27 UTC 2016) changed to RESTARTING. > 17:19:42,766 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > (2/4) (d0d011dc0a0823bcec5a57a369b334ed) switched from FINISHED to FAILED > 17:19:42,773 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:42 CHAIN Partition -> Map (Map at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) > -> Combine (Reduce at > testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(2/4) > switched to FAILED > java.lang.IllegalStateException: Update task on instance > f213232054587f296a12140d56f63ed1 @ > testing-worker-linux-docker-e6d6931f-3200-linux-4 - 2 slots - URL: > akka.tcp://flink@172.17.0.253:43702/user/taskmanager failed due to: > at > org.apache.flink.runtime.executiongraph.Execution$5.onFailure(Execution.java:915) > at akka.dispatch.OnFailure.internal(Future.scala:228) > at akka.dispatch.OnFailure.internal(Future.scala:227) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > at > scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:25) > at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136) > at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:134) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: akka.pattern.AskTimeoutException: Ask timed out on > [Actor[akka.tcp://flink@172.17.0.253:43702/user/taskmanager#-1712955384]] > after [10000 ms] > at > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333) > at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) > at > scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) > at > akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467) > at > akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419) > at > akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423) > at > akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375) > at java.lang.Thread.run(Thread.java:745) > 17:19:42,774 INFO org.apache.flink.runtime.jobmanager.JobManager > - Status of job fa05fd25993a8742da09cc5023c1e38d (Flink Java Job at Mon > Jan 18 17:19:27 UTC 2016) changed to FAILING. > java.lang.IllegalStateException: Update task on instance > f213232054587f296a12140d56f63ed1 @ > testing-worker-linux-docker-e6d6931f-3200-linux-4 - 2 slots - URL: > akka.tcp://flink@172.17.0.253:43702/user/taskmanager failed due to: > at > org.apache.flink.runtime.executiongraph.Execution$5.onFailure(Execution.java:915) > at akka.dispatch.OnFailure.internal(Future.scala:228) > at akka.dispatch.OnFailure.internal(Future.scala:227) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > at > scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:25) > at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136) > at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:134) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: akka.pattern.AskTimeoutException: Ask timed out on > [Actor[akka.tcp://flink@172.17.0.253:43702/user/taskmanager#-1712955384]] > after [10000 ms] > at > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333) > at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) > at > scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) > at > akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467) > at > akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419) > at > akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423) > at > akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375) > at java.lang.Thread.run(Thread.java:745) > 17:19:42,780 INFO org.apache.flink.runtime.client.JobClientActor > - 01/18/2016 17:19:42 Job execution switched to status FAILING. > java.lang.IllegalStateException: Update task on instance > f213232054587f296a12140d56f63ed1 @ > testing-worker-linux-docker-e6d6931f-3200-linux-4 - 2 slots - URL: > akka.tcp://flink@172.17.0.253:43702/user/taskmanager failed due to: > at > org.apache.flink.runtime.executiongraph.Execution$5.onFailure(Execution.java:915) > at akka.dispatch.OnFailure.internal(Future.scala:228) > at akka.dispatch.OnFailure.internal(Future.scala:227) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > at > scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:25) > at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136) > at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:134) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: akka.pattern.AskTimeoutException: Ask timed out on > [Actor[akka.tcp://flink@172.17.0.253:43702/user/taskmanager#-1712955384]] > after [10000 ms] > at > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333) > at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) > at > scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) > at > akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467) > at > akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419) > at > akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423) > at > akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375) > at java.lang.Thread.run(Thread.java:745) > 17:19:49,152 WARN Remoting > - Tried to associate with unreachable remote address > [akka.tcp://flink@172.17.0.253:43702]. Address is now gated for 5000 ms, all > messages to this address will be delivered to dead letters. Reason: > Connection refused: /172.17.0.253:43702 > 17:19:59,172 WARN Remoting > - Tried to associate with unreachable remote address > [akka.tcp://flink@172.17.0.253:43702]. Address is now gated for 5000 ms, all > messages to this address will be delivered to dead letters. Reason: > Connection refused: /172.17.0.253:43702 > 17:20:09,191 WARN Remoting > - Tried to associate with unreachable remote address > [akka.tcp://flink@172.17.0.253:43702]. Address is now gated for 5000 ms, all > messages to this address will be delivered to dead letters. Reason: > Connection refused: /172.17.0.253:43702 > 17:24:32,423 INFO org.apache.flink.runtime.jobmanager.JobManager > - Stopping JobManager akka.tcp://flink@127.0.0.1:56722/user/jobmanager. > 17:24:32,440 ERROR > org.apache.flink.test.recovery.TaskManagerProcessFailureBatchRecoveryITCase > - > -------------------------------------------------------------------------------- > Test > testTaskManagerProcessFailure[0](org.apache.flink.test.recovery.TaskManagerProcessFailureBatchRecoveryITCase) > failed with: > java.lang.AssertionError: The program did not finish in time > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertFalse(Assert.java:64) > at > org.apache.flink.test.recovery.AbstractTaskManagerProcessFailureRecoveryTest.testTaskManagerProcessFailure(AbstractTaskManagerProcessFailureRecoveryTest.java:212) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > at org.junit.rules.RunRules.evaluate(RunRules.java:20) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) > at org.junit.runners.ParentRunner.run(ParentRunner.java:309) > at org.junit.runners.Suite.runChild(Suite.java:127) > at org.junit.runners.Suite.runChild(Suite.java:26) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) > at org.junit.runners.ParentRunner.run(ParentRunner.java:309) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:283) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:173) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:128) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)