[ 
https://issues.apache.org/jira/browse/FLINK-9788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9788:
---------------------------------
    Fix Version/s: 1.6.0

> ExecutionGraph Inconsistency prevents Job from recovering
> ---------------------------------------------------------
>
>                 Key: FLINK-9788
>                 URL: https://issues.apache.org/jira/browse/FLINK-9788
>             Project: Flink
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.6.0
>         Environment: Rev: 4a06160
> Hadoop 2.8.3
>            Reporter: Gary Yao
>            Priority: Critical
>             Fix For: 1.6.0
>
>         Attachments: jobmanager_5000.log
>
>
> Deployment mode: YARN job mode with HA
> After killing many TaskManagers in succession, the state of the 
> ExecutionGraph ran into an inconsistent state, which prevented job recovery. 
> The following stacktrace was logged in the JobManager log several hundred 
> times per second:
> {noformat}
> -08 16:47:18,855 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph 
>        - Job General purpose test job (37a794195840700b98feb23e99f7ea24) 
> switched from state RESTARTING to RESTARTING.
> 2018-07-08 16:47:18,856 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Restarting 
> the job General purpose test job (37a794195840700b98feb23e99f7ea24).
> 2018-07-08 16:47:18,857 DEBUG 
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Resetting 
> execution vertex Source: Custom Source -> Timestamps/Watermarks (1/10) for 
> new execution.
> 2018-07-08 16:47:18,857 WARN  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Failed to 
> restart the job.
> java.lang.IllegalStateException: Cannot reset a vertex that is in 
> non-terminal state CREATED
>         at 
> org.apache.flink.runtime.executiongraph.ExecutionVertex.resetForNewExecution(ExecutionVertex.java:610)
>         at 
> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.resetForNewExecution(ExecutionJobVertex.java:573)
>         at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.restart(ExecutionGraph.java:1251)
>         at 
> org.apache.flink.runtime.executiongraph.restart.ExecutionGraphRestartCallback.triggerFullRecovery(ExecutionGraphRestartCallback.java:59)
>         at 
> org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy$1.run(FixedDelayRestartStrategy.java:68)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {noformat}
> The resulting jobmanager log file was 4.7 GB in size. Find attached the first 
> 5000 lines of the log file. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to