[
https://issues.apache.org/jira/browse/SPARK-26074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16687926#comment-16687926
]
liupengcheng commented on SPARK-26074:
--------------------------------------
I will submit a PR soon
> AsyncEventQueue.stop hangs when eventQueue is full
> --------------------------------------------------
>
> Key: SPARK-26074
> URL: https://issues.apache.org/jira/browse/SPARK-26074
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.1.0, 2.3.0, 2.4.0
> Reporter: liupengcheng
> Priority: Major
>
> In our production environment, we found a case that the Driver hangs when the
> app finished and about to exit.
> Detail information:
> The spark-listener-group-shared Thread may exited due to some plugin
> TaskFailedListener(here in our case is XGBoost),thus the eventQueue might
> easily enter Full state and stay unchanged. After that, if app finished and
> SparkContext.stop is called, the
> Driver will hangs at the following stack:
> {code:java}
> "Driver" #36 prio=5 os_prio=0 tid=0x00007fb08a948800 nid=0x6edc waiting on
> condition [0x00007faff5017000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000006c718d7d8> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:350)
> at org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:117)
> at
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:202)
> at
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:202)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:202)
> - locked <0x00000006c052f0a8> (a org.apache.spark.scheduler.LiveListenerBus)
> at
> org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1842)
> at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1294)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:1841){code}
> Find out it's because spark put a POSION_PILL when AsyncEventQueue.stop,
> however, the queue is in full state, so the put action will be blocked
> forever.
> spark-listener-group-shared exit message:
> {code:java}
> 2018-11-15,14:44:04,782 INFO org.apache.spark.scheduler.AsyncEventQueue:
> Stopping listener queue shared.
> java.lang.InterruptedException: ExecutorLost during XGBoost Training:
> TaskKilled (killed intentionally)
> at
> org.apache.spark.TaskFailedListener.onTaskEnd(SparkParallelismTracker.scala:116)
> at
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
> at
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:35)
> at
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:35)
> at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
> at
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:83)
> at
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:83)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:79)
> at
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:75)
> at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1256)
> at
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:74)
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]