[
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228966#comment-14228966
]
Josh Rosen commented on SPARK-4498:
-----------------------------------
Here's an interesting pattern to grep for in all-master-logs-around-blip.txt:
{{[email protected]:52047}}. Note that this log is in
reverse-chronological order.
The earliest occurrence is in a DisassociatedEvent log message:
{code}
14-11-19_18:48:31.34508 14/11/19 18:48:31 ERROR EndpointWriter:
AssociationError [akka.tcp://[email protected]:7077] <-
[akka.tcp://[email protected]:52047]: Error [Shut down
address: akka.tcp://[email protected]:52047] [
2014-11-19_18:48:31.34510 akka.remote.ShutDownAssociation: Shut down address:
akka.tcp://[email protected]:52047
2014-11-19_18:48:31.34511 Caused by:
akka.remote.transport.Transport$InvalidAssociationException: The remote system
terminated the association because it is shutting down.
2014-11-19_18:48:31.34512 ]
2014-11-19_18:48:31.34521 14/11/19 18:48:31 INFO LocalActorRef: Message
[akka.remote.transport.AssociationHandle$Disassociated] from
Actor[akka://sparkMaster/deadLetters] to
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40172.16.126.88%3A48040-1355#-59270061]
was not delivered. [2859] dead letters encountered. This logging can be turned
off or adjusted with configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
2014-11-19_18:48:31.34603 14/11/19 18:48:31 INFO Master:
akka.tcp://[email protected]:52047 got disassociated,
removing it.
2014-11-19_18:48:31.20255 14/11/19 18:48:31 INFO Master: Removing executor
app-20141119184815-1316/7 because it is EXITED
{code}
Even though INFO-level logging is enabled, there's no "INFO: master: Removing
app ...." message near this event. The entire log contains many repetitions of
this same DisassociatedEvent log.
The same log also contains many executors that launch and immediately fail:
{code}
2014-11-19_18:52:51.84000 14/11/19 18:52:51 INFO Master: Launching executor
app-20141119184815-1313/75 on worker
worker-20141118172622-dn19.chi.shopify.com-38498
2014-11-19_18:52:51.83981 14/11/19 18:52:51 INFO Master: Removing executor
app-20141119184815-1313/67 because it is EXITED
{code}
I couldn't find a {{removing app app-20141119184815-1313}} event.
Another interesting thing: even though it looks like this log contains
information for 39 drivers, there are 100 disassociated events:
{code}
[joshrosen ~]$ cat /Users/joshrosen/Desktop/all-master-logs-around-blip.txt |
grep -e "\d\d\d\d\d got disassociated" -o | cut -d ' ' -f 1 | sort | uniq | wc
-l
39
[joshrosen ~]$ cat /Users/joshrosen/Desktop/all-master-logs-around-blip.txt |
grep -e "\d\d\d\d\d got disassociated" -o | cut -d ' ' -f 1 | sort | wc -l
100
{code}
> Standalone Master can fail to recognize completed/failed applications
> ---------------------------------------------------------------------
>
> Key: SPARK-4498
> URL: https://issues.apache.org/jira/browse/SPARK-4498
> Project: Spark
> Issue Type: Bug
> Components: Deploy, Spark Core
> Affects Versions: 1.1.1, 1.2.0
> Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic
> #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
> - Standalone Spark built from
> apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
> - Python 2.7.3
> java version "1.7.0_71"
> Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
> - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of
> memory a piece
> - All client code is PySpark
> Reporter: Harry Brundage
> Priority: Blocker
> Attachments: all-master-logs-around-blip.txt,
> one-applications-master-logs.txt
>
>
> We observe the spark standalone master not detecting that a driver
> application has completed after the driver process has shut down
> indefinitely, leaving that driver's resources consumed indefinitely. The
> master reports applications as Running, but the driver process has long since
> terminated. The master continually spawns one executor for the application.
> It boots, times out trying to connect to the driver application, and then
> dies with the exception below. The master then spawns another executor on a
> different worker, which does the same thing. The application lives until the
> master (and workers) are restarted.
> This happens to many jobs at once, all right around the same time, two or
> three times a day, where they all get suck. Before and after this "blip"
> applications start, get resources, finish, and are marked as finished
> properly. The "blip" is mostly conjecture on my part, I have no hard evidence
> that it exists other than my identification of the pattern in the Running
> Applications table. See
> http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
> : the applications started before the blip at 1.9 hours ago still have
> active drivers. All the applications started 1.9 hours ago do not, and the
> applications started less than 1.9 hours ago (at the top of the table) do in
> fact have active drivers.
> Deploy mode:
> - PySpark drivers running on one node outside the cluster, scheduled by a
> cron-like application, not master supervised
>
> Other factoids:
> - In most places, we call sc.stop() explicitly before shutting down our
> driver process
> - Here's the sum total of spark configuration options we don't set to the
> default:
> {code}
> "spark.cores.max": 30
> "spark.eventLog.dir": "hdfs://nn.shopify.com:8020/var/spark/event-logs"
> "spark.eventLog.enabled": true
> "spark.executor.memory": "7g"
> "spark.hadoop.fs.defaultFS": "hdfs://nn.shopify.com:8020/"
> "spark.io.compression.codec": "lzf"
> "spark.ui.killEnabled": true
> {code}
> - The exception the executors die with is this:
> {code}
> 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal
> handlers for [TERM, HUP, INT]
> 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library
> for your platform... using builtin-java classes where applicable
> 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban
> 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban
> 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(spark, azkaban);
> users with modify permissions: Set(spark, azkaban)
> 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started
> 14/11/19 19:42:37 INFO Remoting: Starting remoting
> 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://[email protected]:37682]
> 14/11/19 19:42:38 INFO Utils: Successfully started service
> 'driverPropsFetcher' on port 37682.
> 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote
> address [akka.tcp://[email protected]:58849]. Address is
> now gated for 5000 ms, all messages to this address will be delivered to dead
> letters. Reason: Connection refused:
> spark-etl1.chi.shopify.com/172.16.126.88:58849
> 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException
> as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures
> timed out after [30 seconds]
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException:
> Unknown exception in doAs
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1421)
> at
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: java.security.PrivilegedActionException:
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
> ... 4 more
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30
> seconds]
> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127)
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
> ... 7 more
> {code}
> Cluster history:
> - We run spark versions built from apache/spark#master snapshots. We did not
> observe this behaviour on {{7eb9cbc273d758522e787fcb2ef68ef65911475f}} (sorry
> its so old), but now observe it on
> {{c6e0c2ab1c29c184a9302d23ad75e4ccd8060242}}. We can try new versions to
> assist debugging.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]