[jira] [Commented] (FLINK-3534) Cancelling a running job can lead to restart instead of stopping

ASF GitHub Bot (JIRA) Mon, 29 Feb 2016 09:04:40 -0800

    [ 
https://issues.apache.org/jira/browse/FLINK-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172150#comment-15172150
 ]


ASF GitHub Bot commented on FLINK-3534:
---------------------------------------

Github user uce commented on the pull request:

    https://github.com/apache/flink/pull/1735#issuecomment-190291277
  
    I've fixed the test failures.
    
    @StephanEwen, the changes can result in the following behaviour: if a slot 
is released, some of its executions can end up in cancelled rather than failed 
state. Is that OK? The following happens: The slot's first execution is failed. 
This triggers an execution graph fail, after which all executions are 
cancelled. When the slot fails further executions, these are already cancelling 
and go to state cancelled. Before they would still go state `FAILED`.


> Cancelling a running job can lead to restart instead of stopping
> ----------------------------------------------------------------
>
>                 Key: FLINK-3534
>                 URL: https://issues.apache.org/jira/browse/FLINK-3534
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Robert Metzger
>            Priority: Critical
>
> I just tried cancelling a regularly running job. Instead of the job stopping, 
> it restarted.
> {code}
> 2016-02-29 10:39:28,415 INFO  org.apache.flink.yarn.YarnJobManager            
>               - Trying to cancel job with ID 5c0604694c8469cfbb89daaa990068df.
> 2016-02-29 10:39:28,416 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Out 
> of order data generator -> (Flat Map, Timestamps/Watermarks) (1/1) 
> (e3b05555ab0e373defb925898de9f200) switched from RUNNING to CANCELING
> ....
> 2016-02-29 10:39:28,488 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - 
> TriggerWindow(TumblingTimeWindows(60000), 
> FoldingStateDescriptor{name=window-contents, 
> defaultValue=(0,9223372036854775807,0), serializer=null}, EventTimeTrigger(), 
> WindowedStream.apply(WindowedStream.java:397)) (19/24) 
> (c1be31b0be596d2521073b2d78ffa60a) switched from CANCELING to CANCELED
> 2016-02-29 10:40:08,468 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Out 
> of order data generator -> (Flat Map, Timestamps/Watermarks) (1/1) 
> (e3b05555ab0e373defb925898de9f200) switched from CANCELING to FAILED
> 2016-02-29 10:40:08,468 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - 
> TriggerWindow(TumblingTimeWindows(60000), 
> FoldingStateDescriptor{name=window-contents, 
> defaultValue=(0,9223372036854775807,0), serializer=null}, EventTimeTrigger(), 
> WindowedStream.apply(WindowedStream.java:397)) (1/24) 
> (5ad172ec9932b24d5a98377a2c82b0b3) switched from CANCELING to FAILED
> 2016-02-29 10:40:08,472 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - 
> TriggerWindow(TumblingTimeWindows(60000), 
> FoldingStateDescriptor{name=window-contents, 
> defaultValue=(0,9223372036854775807,0), serializer=null}, EventTimeTrigger(), 
> WindowedStream.apply(WindowedStream.java:397)) (2/24) 
> (5404ca28ac7cf23b67dff30ef2309078) switched from CANCELING to FAILED
> 2016-02-29 10:40:08,473 INFO  org.apache.flink.yarn.YarnJobManager            
>               - Status of job 5c0604694c8469cfbb89daaa990068df (Event 
> counter: {auto.offset.reset=earliest, rocksdb=hdfs:///user/robert/rocksdb, 
> generateInPlace=soTrue, parallelism=24, 
> bootstrap.servers=cdh544-worker-0:9092, topic=eventsGenerator, 
> eventsPerKeyPerGenerator=2, numKeys=1000000000, 
> zookeeper.connect=cdh544-worker-0:2181, timeSliceSize=60000, eventsKerPey=1, 
> genPar=1}) changed to FAILING.
> java.lang.Exception: Task could not be canceled.
>       at 
> org.apache.flink.runtime.executiongraph.Execution$5.onComplete(Execution.java:902)
>       at akka.dispatch.OnComplete.internal(Future.scala:246)
>       at akka.dispatch.OnComplete.internal(Future.scala:244)
>       at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174)
>       at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171)
>       at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>       at 
> scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
>       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: akka.pattern.AskTimeoutException: Ask timed out on 
> [Actor[akka.tcp://[email protected]:50119/user/taskmanager#640539146]] 
> after [10000 ms]
>       at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333)
>       at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
>       at 
> scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
>       at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)
>       at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)
>       at 
> akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)
>       at 
> akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)
>       at 
> akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)
>       at java.lang.Thread.run(Thread.java:745)
> 2016-02-29 10:40:08,477 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - 
> TriggerWindow(TumblingTimeWindows(60000), 
> FoldingStateDescriptor{name=window-contents, 
> defaultValue=(0,9223372036854775807,0), serializer=null}, EventTimeTrigger(), 
> WindowedStream.apply(WindowedStream.java:397)) (3/24) 
> (fc527d65ec8df3ccf68f882d968e776e) switched from CANCELING to FAILED
> 2016-02-29 10:40:08,487 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - 
> TriggerWindow(TumblingTimeWindows(60000), 
> FoldingStateDescriptor{name=window-contents, 
> defaultValue=(0,9223372036854775807,0), serializer=null}, EventTimeTrigger(), 
> WindowedStream.apply(WindowedStream.java:397)) (4/24) 
> (afb1aa3c2d8acdee0f138cf344238e4e) switched from CANCELING to FAILED
> 2016-02-29 10:40:08,488 INFO  
> org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy  - 
> Delaying retry of job execution for 3000 ms ...
> 2016-02-29 10:40:08,488 INFO  org.apache.flink.yarn.YarnJobManager            
>               - Status of job 5c0604694c8469cfbb89daaa990068df (Event 
> counter: {auto.offset.reset=earliest, rocksdb=hdfs:///user/robert/rocksdb, 
> generateInPlace=soTrue, parallelism=24, 
> bootstrap.servers=cdh544-worker-0:9092, topic=eventsGenerator, 
> eventsPerKeyPerGenerator=2, numKeys=1000000000, 
> zookeeper.connect=cdh544-worker-0:2181, timeSliceSize=60000, eventsKerPey=1, 
> genPar=1}) changed to RESTARTING.
> 2016-02-29 10:40:11,490 INFO  org.apache.flink.yarn.YarnJobManager            
>               - Status of job 5c0604694c8469cfbb89daaa990068df (Event 
> counter: {auto.offset.reset=earliest, rocksdb=hdfs:///user/robert/rocksdb, 
> generateInPlace=soTrue, parallelism=24, 
> bootstrap.servers=cdh544-worker-0:9092, topic=eventsGenerator, 
> eventsPerKeyPerGenerator=2, numKeys=1000000000, 
> zookeeper.connect=cdh544-worker-0:2181, timeSliceSize=60000, eventsKerPey=1, 
> genPar=1}) changed to CREATED.
> 2016-02-29 10:40:11,490 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Out 
> of order data generator -> (Flat Map, Timestamps/Watermarks) (1/1) 
> (1319b2f44d78d99948ffde4350c052d9) switched from CREATED to SCHEDULED
> 2016-02-29 10:40:11,490 INFO  org.apache.flink.yarn.YarnJobManager            
>               - Status of job 5c0604694c8469cfbb89daaa990068df (Event 
> counter: {auto.offset.reset=earliest, rocksdb=hdfs:///user/robert/rocksdb, 
> generateInPlace=soTrue, parallelism=24, 
> bootstrap.servers=cdh544-worker-0:9092, topic=eventsGenerator, 
> eventsPerKeyPerGenerator=2, numKeys=1000000000, 
> zookeeper.connect=cdh544-worker-0:2181, timeSliceSize=60000, eventsKerPey=1, 
> genPar=1}) changed to RUNNING.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-3534) Cancelling a running job can lead to restart instead of stopping

Reply via email to