[
https://issues.apache.org/jira/browse/SPARK-19068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15803628#comment-15803628
]
JESSE CHEN commented on SPARK-19068:
------------------------------------
Well, though it does not affect the correctness of the results, but a query
that seemingly takes only 30 minutes now takes 2.5 hours is a concern to Spark
users. I used the 'spark-sql' shell so not until the shell exit, normal users
will not know the query actually finished. Plus, Spark is hogging resources
(memory and cores) until SparkContext exits, so this is an usability and trust
issue.
I also think this always occurs on high volume and on a large cluster. As Spark
is being adapted by enterprise users, this issue will be in the fore-front.
I do think there is a fundamental timing issue here.
> Large number of executors causing a ton of ERROR scheduler.LiveListenerBus:
> SparkListenerBus has already stopped! Dropping event
> SparkListenerExecutorMetricsUpdate(41,WrappedArray())
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-19068
> URL: https://issues.apache.org/jira/browse/SPARK-19068
> Project: Spark
> Issue Type: Bug
> Affects Versions: 2.1.0
> Environment: RHEL 7.2
> Reporter: JESSE CHEN
> Attachments: sparklog.tar.gz
>
>
> On a large cluster with 45TB RAM and 1,000 cores, we used 1008 executors in
> order to use all RAM and cores for a 100TB Spark SQL workload. Long-running
> queries tend to report the following ERRORs
> {noformat}
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has
> already stopped! Dropping event
> SparkListenerExecutorMetricsUpdate(136,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has
> already stopped! Dropping event
> SparkListenerExecutorMetricsUpdate(853,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has
> already stopped! Dropping event
> SparkListenerExecutorMetricsUpdate(395,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has
> already stopped! Dropping event
> SparkListenerExecutorMetricsUpdate(736,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has
> already stopped! Dropping event
> SparkListenerExecutorMetricsUpdate(439,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has
> already stopped! Dropping event
> SparkListenerExecutorMetricsUpdate(16,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has
> already stopped! Dropping event
> SparkListenerExecutorMetricsUpdate(307,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has
> already stopped! Dropping event
> SparkListenerExecutorMetricsUpdate(51,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has
> already stopped! Dropping event
> SparkListenerExecutorMetricsUpdate(535,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has
> already stopped! Dropping event
> SparkListenerExecutorMetricsUpdate(63,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has
> already stopped! Dropping event
> SparkListenerExecutorMetricsUpdate(333,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has
> already stopped! Dropping event
> SparkListenerExecutorMetricsUpdate(484,WrappedArray())
> ....(omitted)
> {noformat}
> The message itself maybe a reasonable response to a already stopped
> SparkListenerBus (so subsequent events are thrown away with that ERROR
> message). The issue is that because SparkContext does NOT exit until all
> these ERROR/events are reported, which is a huge number in our setup -- and
> this can take, in some cases, hours!!!
> We tried increasing the
> Adding default property: spark.scheduler.listenerbus.eventqueue.size=130000
> from 10K, this still occurs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]