[ https://issues.apache.org/jira/browse/SPARK-19068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
JESSE CHEN updated SPARK-19068: ------------------------------- Attachment: sparklog.tar.gz This is the Spark console output in which you can find settings and sequence of events. At end you will see the "never-ending" event dropping messages. > Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: > SparkListenerBus has already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(41,WrappedArray()) > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-19068 > URL: https://issues.apache.org/jira/browse/SPARK-19068 > Project: Spark > Issue Type: Bug > Affects Versions: 2.1.0 > Environment: RHEL 7.2 > Reporter: JESSE CHEN > Attachments: sparklog.tar.gz > > > On a large cluster with 45TB RAM and 1,000 cores, we used 1008 executors in > order to use all RAM and cores for a 100TB Spark SQL workload. Long-running > queries tend to report the following ERRORs > {noformat} > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(136,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(853,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(395,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(736,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(439,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(16,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(307,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(51,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(535,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(63,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(333,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(484,WrappedArray()) > ....(omitted) > {noformat} > The message itself maybe a reasonable response to a already stopped > SparkListenerBus (so subsequent events are thrown away with that ERROR > message). The issue is that because SparkContext does NOT exit until all > these ERROR/events are reported, which is a huge number in our setup -- and > this can take, in some cases, hours!!! > We tried increasing the > Adding default property: spark.scheduler.listenerbus.eventqueue.size=130000 > from 10K, this still occurs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org