[ https://issues.apache.org/jira/browse/FLINK-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299824#comment-14299824 ]
ASF GitHub Bot commented on FLINK-1376: --------------------------------------- Github user tillrohrmann commented on the pull request: https://github.com/apache/flink/pull/317#issuecomment-72319281 No I don't think that this has something to do with the shared slot release. It should be handled in a separate ticket. On Sat, Jan 31, 2015 at 11:54 AM, Robert Metzger <notificati...@github.com> wrote: > Hi, > maybe thats a helpful datapoint for you: > I have a Flink cluster started where all TaskManagers died > (misconfiguration). The JobManager needs more than 200 seconds to realize > that (on the TaskManagers overview, you see timeouts < 200). When > submitting a job, you'll get the following exception: > > org.apache.flink.client.program.ProgramInvocationException: The program execution failed: java.lang.Exception: Failed to deploy the task CHAIN DataSource (Generator: class io.airlift.tpch.NationGenerator) -> Map (Map at writeAsFormattedText(DataSet.java:1132)) (1/1) - execution #0 to slot SubSlot 0 (f8d11026ec5a11f0b273184c74ec4f29 (0) - ALLOCATED/ALIVE): java.lang.NullPointerException > at org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$submitTask(TaskManager.scala:346) > at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$receiveWithLogMessages$1.applyOrElse(TaskManager.scala:248) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at org.apache.flink.yarn.YarnTaskManager$$anonfun$receiveYarnMessages$1.applyOrElse(YarnTaskManager.scala:32) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) > at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:41) > at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:27) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:27) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:78) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) > at akka.dispatch.Mailbox.run(Mailbox.scala:221) > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > > at org.apache.flink.runtime.executiongraph.Execution$2.onComplete(Execution.java:311) > at akka.dispatch.OnComplete.internal(Future.scala:247) > at akka.dispatch.OnComplete.internal(Future.scala:244) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > > at org.apache.flink.client.program.Client.run(Client.java:345) > at org.apache.flink.client.program.Client.run(Client.java:304) > at org.apache.flink.client.program.Client.run(Client.java:298) > at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:55) > at flink.generators.programs.TPCHGenerator.main(TPCHGenerator.java:80) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:437) > at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:353) > at org.apache.flink.client.program.Client.run(Client.java:250) > at org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:389) > at org.apache.flink.client.CliFrontend.run(CliFrontend.java:358) > at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1068) > at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1092) > > I think a NPE is never desired. Does this pull request cover this case? > > — > Reply to this email directly or view it on GitHub > <https://github.com/apache/flink/pull/317#issuecomment-72313219>. > > SubSlots are not properly released in case that a TaskManager fatally fails, > leaving the system in a corrupted state > -------------------------------------------------------------------------------------------------------------------- > > Key: FLINK-1376 > URL: https://issues.apache.org/jira/browse/FLINK-1376 > Project: Flink > Issue Type: Bug > Reporter: Till Rohrmann > Assignee: Till Rohrmann > > In case that the TaskManager fatally fails and some of the failing node's > slots are SharedSlots, then the slots are not properly released by the > JobManager. This causes that the corresponding job will not be properly > failed, leaving the system in a corrupted state. > The reason for that is that the AllocatedSlot is not aware of being treated > as a SharedSlot and thus he cannot release the associated SubSlots. -- This message was sent by Atlassian JIRA (v6.3.4#6332)