[jira] [Commented] (SPARK-19068) Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(41,WrappedArray())

Artur Sukhenko (JIRA) Wed, 05 Apr 2017 10:14:57 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-19068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15957258#comment-15957258
 ]


Artur Sukhenko commented on SPARK-19068:
----------------------------------------

Having similar problem.
Reproduce: 
{panel}[spark-2.1.0-bin-without-hadoop]$ ./bin/run-example --master yarn 
--deploy-mode client --num-executors 4 SparkPi 1000000{panel}
Made jstack of driver process and found this thread:
{code}"SparkListenerBus" #10 daemon prio=5 os_prio=0 tid=0x00007fc8fdc8c800 
nid=0x37e7 runnable [0x00007fc838764000]
   java.lang.Thread.State: RUNNABLE
        at scala.collection.mutable.HashTable$class.resize(HashTable.scala:262)
        at 
scala.collection.mutable.HashTable$class.scala$collection$mutable$HashTable$$addEntry0(HashTable.scala:154)
        at 
scala.collection.mutable.HashTable$class.findOrAddEntry(HashTable.scala:166)
        at 
scala.collection.mutable.LinkedHashMap.findOrAddEntry(LinkedHashMap.scala:49)
        at scala.collection.mutable.LinkedHashMap.put(LinkedHashMap.scala:71)
        at 
scala.collection.mutable.LinkedHashMap.$plus$eq(LinkedHashMap.scala:89)
        at 
scala.collection.mutable.LinkedHashMap.$plus$eq(LinkedHashMap.scala:49)
        at 
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
        at 
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
        at scala.collection.mutable.AbstractMap.$plus$plus$eq(Map.scala:80)
        at scala.collection.IterableLike$class.drop(IterableLike.scala:152)
        at scala.collection.AbstractIterable.drop(Iterable.scala:54)
        at 
org.apache.spark.ui.jobs.JobProgressListener.onTaskEnd(JobProgressListener.scala:412)
        - locked <0x00000000800b9db8> (a 
org.apache.spark.ui.jobs.JobProgressListener)
        at 
org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
        at 
org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
        at 
org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
        at 
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
        at 
org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
        at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
        at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
        at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
        at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
        at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1245)
        at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
{code}

At 100k+ tasks job starts to be very slow and I am getting following errors:
{code}
[Stage 0:=====>                                          (108000 + 4) / 
1000000]17/04/06 01:58:49 WARN LiveListenerBus: Dropped 182143 
SparkListenerEvents since Thu Apr 06 01:57:49 JST 2017
[Stage 0:=====>                                          (109588 + 4) / 
1000000]17/04/06 01:59:49 WARN LiveListenerBus: Dropped 196647 
SparkListenerEvents since Thu Apr 06 01:58:49 JST 2017
[Stage 0:=====>                                          (111241 + 5) / 1000000]
{code}
After some time we get this:
{code}
rBus: SparkListenerBus has already stopped! Dropping event 
SparkListenerExecutorMetricsUpdate(2,WrappedArray())
[Stage 0:======>                                         (126782 + 4) / 
1000000]17/04/06 02:12:28 ERROR LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(3,WrappedArray())
[Stage 0:======>                                         (126919 + 5) / 
1000000]17/04/06 02:12:31 ERROR LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(4,WrappedArray())
[Stage 0:======>                                         (126982 + 4) / 
1000000]17/04/06 02:12:32 ERROR LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(1,WrappedArray())
[Stage 0:======>                                         (127030 + 5) / 
1000000]17/04/06 02:12:33 ERROR LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(2,WrappedArray())
[Stage 0:======>                                         (127211 + 4) / 
1000000]17/04/06 02:12:38 ERROR LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(3,WrappedArray())
[Stage 0:======>                                         (127326 + 4) / 
1000000]17/04/06 02:12:41 ERROR LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(4,WrappedArray())
[Stage 0:======>                                         (127374 + 5) / 
1000000]17/04/06 02:12:42 ERROR LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(1,WrappedArray())
[Stage 0:======>                                         (127408 + 4) / 
1000000]17/04/06 02:12:43 ERROR LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(2,WrappedArray())
[Stage 0:======>                                         (127581 + 4) / 
1000000]17/04/06 02:12:48 ERROR LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(3,WrappedArray())
[Stage 0:======>                                         (127687 + 4) / 
1000000]17/04/06 02:12:51 ERROR LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(4,WrappedArray())
{code}


> Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: 
> SparkListenerBus has already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(41,WrappedArray())
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-19068
>                 URL: https://issues.apache.org/jira/browse/SPARK-19068
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 2.1.0
>         Environment: RHEL 7.2
>            Reporter: JESSE CHEN
>         Attachments: sparklog.tar.gz
>
>
> On a large cluster with 45TB RAM and 1,000 cores, we used 1008 executors in 
> order to use all RAM and cores for a 100TB Spark SQL workload. Long-running 
> queries tend to report the following ERRORs
> {noformat}
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(136,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(853,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(395,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(736,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(439,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(16,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(307,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(51,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(535,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(63,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(333,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(484,WrappedArray())
> ....(omitted) 
> {noformat}
> The message itself maybe a reasonable response to a already stopped 
> SparkListenerBus (so subsequent events are thrown away with that ERROR 
> message). The issue is that because SparkContext does NOT exit until all 
> these ERROR/events are reported, which is a huge number in our setup -- and 
> this can take, in some cases, hours!!!
> We tried increasing the 
> Adding default property: spark.scheduler.listenerbus.eventqueue.size=130000
> from 10K, this still occurs. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19068) Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(41,WrappedArray())

Reply via email to