[
https://issues.apache.org/jira/browse/FLINK-9788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Till Rohrmann reassigned FLINK-9788:
------------------------------------
Assignee: (was: Till Rohrmann)
> ExecutionGraph Inconsistency prevents Job from recovering
> ---------------------------------------------------------
>
> Key: FLINK-9788
> URL: https://issues.apache.org/jira/browse/FLINK-9788
> Project: Flink
> Issue Type: Bug
> Components: Core
> Affects Versions: 1.6.0
> Environment: Rev: 4a06160
> Hadoop 2.8.3
> Reporter: Gary Yao
> Priority: Critical
> Fix For: 1.6.1, 1.7.0
>
> Attachments: jobmanager_5000.log
>
>
> Deployment mode: YARN job mode with HA
> After killing many TaskManagers in succession, the state of the
> ExecutionGraph ran into an inconsistent state, which prevented job recovery.
> The following stacktrace was logged in the JobManager log several hundred
> times per second:
> {noformat}
> -08 16:47:18,855 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph
> - Job General purpose test job (37a794195840700b98feb23e99f7ea24)
> switched from state RESTARTING to RESTARTING.
> 2018-07-08 16:47:18,856 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph - Restarting
> the job General purpose test job (37a794195840700b98feb23e99f7ea24).
> 2018-07-08 16:47:18,857 DEBUG
> org.apache.flink.runtime.executiongraph.ExecutionGraph - Resetting
> execution vertex Source: Custom Source -> Timestamps/Watermarks (1/10) for
> new execution.
> 2018-07-08 16:47:18,857 WARN
> org.apache.flink.runtime.executiongraph.ExecutionGraph - Failed to
> restart the job.
> java.lang.IllegalStateException: Cannot reset a vertex that is in
> non-terminal state CREATED
> at
> org.apache.flink.runtime.executiongraph.ExecutionVertex.resetForNewExecution(ExecutionVertex.java:610)
> at
> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.resetForNewExecution(ExecutionJobVertex.java:573)
> at
> org.apache.flink.runtime.executiongraph.ExecutionGraph.restart(ExecutionGraph.java:1251)
> at
> org.apache.flink.runtime.executiongraph.restart.ExecutionGraphRestartCallback.triggerFullRecovery(ExecutionGraphRestartCallback.java:59)
> at
> org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy$1.run(FixedDelayRestartStrategy.java:68)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> The resulting jobmanager log file was 4.7 GB in size. Find attached the first
> 5000 lines of the log file.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)