[
https://issues.apache.org/jira/browse/SPARK-3150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tatiana Borisova updated SPARK-3150:
------------------------------------
Description:
The issue happens when Spark is run standalone on a cluster.
When master and driver fall simultaneously on one node in a cluster, master
tries to recover its state and restart spark driver.
While restarting driver, it falls with NPE exception (stacktrace is below).
After falling, it restarts and tries to recover its state and restart Spark
driver again. It happens over and over in an infinite cycle.
Namely, Spark tries to read DriverInfo state from zookeeper, but after reading
it happens to be null in DriverInfo.worker.
Stacktrace (on version 1.0.0, but reproduceable on version 1.0.2, too)
2014-08-14 21:44:59,519] ERROR (akka.actor.OneForOneStrategy)
java.lang.NullPointerException
at
org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448)
at
org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448)
at
scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at
scala.collection.TraversableLike$class.filter(TraversableLike.scala:263)
at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
at
org.apache.spark.deploy.master.Master.completeRecovery(Master.scala:448)
at
org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:376)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
How to reproduce: kill all Spark processes when running Spark standalone on a
cluster on some cluster node, where driver runs (kill driver, master and worker
simultaneously).
was:
The issue happens when Spark is run standalone on a cluster.
When master and driver fall simultaneously on one node in a cluster, master
tries to recover its state and restart spark driver.
While restarting driver, it falls with NPE exception (stacktrace is below).
After falling, it restarts and tries to recover its state and restart Spark
driver again. It happens over and over in an infinite cycle.
Namely, Spark tries to read DriverInfo state from zookeeper, but after reading
it happens to be null in DriverInfo.worker.
Stacktrace (on version 1.0.0, but reproduceable on version 1.0.2, too)
2014-08-14 21:44:59,519] ERROR (akka.actor.OneForOneStrategy)
java.lang.NullPointerException
at
org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448)
at
org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448)
at
scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at
scala.collection.TraversableLike$class.filter(TraversableLike.scala:263)
at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
at
org.apache.spark.deploy.master.Master.completeRecovery(Master.scala:448)
at
org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:376)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
How to reproduce: kill both master and driver processes on some cluster node
when running Spark standalone on a cluster.
> NullPointerException in Spark recovery after simultaneous fall of master and
> driver
> -----------------------------------------------------------------------------------
>
> Key: SPARK-3150
> URL: https://issues.apache.org/jira/browse/SPARK-3150
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.0.2
> Environment: Linux 3.2.0-23-generic x86_64
> Reporter: Tatiana Borisova
>
> The issue happens when Spark is run standalone on a cluster.
> When master and driver fall simultaneously on one node in a cluster, master
> tries to recover its state and restart spark driver.
> While restarting driver, it falls with NPE exception (stacktrace is below).
> After falling, it restarts and tries to recover its state and restart Spark
> driver again. It happens over and over in an infinite cycle.
> Namely, Spark tries to read DriverInfo state from zookeeper, but after
> reading it happens to be null in DriverInfo.worker.
> Stacktrace (on version 1.0.0, but reproduceable on version 1.0.2, too)
> 2014-08-14 21:44:59,519] ERROR (akka.actor.OneForOneStrategy)
> java.lang.NullPointerException
> at
> org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448)
> at
> org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448)
> at
> scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
> at
> scala.collection.TraversableLike$class.filter(TraversableLike.scala:263)
> at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
> at
> org.apache.spark.deploy.master.Master.completeRecovery(Master.scala:448)
> at
> org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:376)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> How to reproduce: kill all Spark processes when running Spark standalone on a
> cluster on some cluster node, where driver runs (kill driver, master and
> worker simultaneously).
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]