[jira] [Updated] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications

Harry Brundage (JIRA) Wed, 19 Nov 2014 12:50:53 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Harry Brundage updated SPARK-4498:
----------------------------------
    Description: 
We observe the spark standalone master not detecting that a driver application 
has completed after the driver process has shut down indefinitely, leaving that 
driver's resources consumed indefinitely. The master reports applications as 
Running, but the driver process has long since terminated. The master 
continually spawns one executor for the application. It boots, times out trying 
to connect to the driver application, and then dies with the exception below. 
The master then spawns another executor on a different worker, which does the 
same thing. The application lives until the master (and workers) are restarted. 

This happens to many jobs at once, all right around the same time, two or three 
times a day, where they all get suck. Before and after this "blip" applications 
start, get resources, finish, and are marked as finished properly. The "blip" 
is mostly conjecture on my part, I have no hard evidence that it exists other 
than my identification of the pattern in the Running Applications table. See 
http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
 : the applications started before the blip at 1.9 hours ago still have active 
drivers. All the applications started 1.9 hours ago do not, and the 
applications started less than 1.9 hours ago (at the top of the table) do in 
fact have active drivers.


Deploy mode:
 - PySpark drivers running on one node outside the cluster, scheduled by a 
cron-like application, not master supervised
 

Other factoids:
 - In most places, we call sc.stop() explicitly before shutting down our driver 
process
 - Here's the sum total of spark configuration options we don't set to the 
default:
{code}
    "spark.cores.max": 30
    "spark.eventLog.dir": "hdfs://nn.shopify.com:8020/var/spark/event-logs"
    "spark.eventLog.enabled": true
    "spark.executor.memory": "7g"
    "spark.hadoop.fs.defaultFS": "hdfs://nn.shopify.com:8020/"
    "spark.io.compression.codec": "lzf"
    "spark.ui.killEnabled": true
{code}
 - The exception the executors die with is this:
{code}
14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal handlers 
for [TERM, HUP, INT]
14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban
14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban
14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
users with modify permissions: Set(spark, azkaban)
14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started
14/11/19 19:42:37 INFO Remoting: Starting remoting
14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://[email protected]:37682]
14/11/19 19:42:38 INFO Utils: Successfully started service 'driverPropsFetcher' 
on port 37682.
14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote 
address [akka.tcp://[email protected]:58849]. Address is 
now gated for 5000 ms, all messages to this address will be delivered to dead 
letters. Reason: Connection refused: 
spark-etl1.chi.shopify.com/172.16.126.88:58849
14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException 
as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures 
timed out after [30 seconds]
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: 
Unknown exception in doAs
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1421)
        at 
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
        at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115)
        at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
        at 
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.security.PrivilegedActionException: 
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
        ... 4 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
        at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
        at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
        at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
        at scala.concurrent.Await$.result(package.scala:107)
        at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127)
        at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
        at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
        ... 7 more
{code}


Cluster history:
 - We run spark versions built from apache/spark#master snapshots. We did not 
observe this behaviour on {{7eb9cbc273d758522e787fcb2ef68ef65911475f}} (sorry 
its so old), but now observe it on 
{{c6e0c2ab1c29c184a9302d23ad75e4ccd8060242}}. We can try new versions to assist 
debugging.


  was:
We observe the spark standalone master not detecting that a driver application 
has completed after the driver process has shut down indefinitely, leaving that 
driver's resources consumed indefinitely. The master reports applications as 
Running, but the driver process has long since terminated. The master 
continually spawns one executor for the application. It boots, times out trying 
to connect to the driver application, and then dies with the exception below. 
The master then spawns another executor on a different worker, which does the 
same thing. The application lives until the master (and workers) are restarted. 

This happens to many jobs at once, all right around the same time, two or three 
times a day, where they all get suck. Before and after this "blip" applications 
start, get resources, finish, and are marked as finished properly. The "blip" 
is mostly conjecture on my part, I have no hard evidence that it exists other 
than my identification of the pattern in the Running Applications table. See 
http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
 : the applications started before the blip at 1.9 hours ago still have active 
drivers. All the applications started 1.9 hours ago do not, and the 
applications started less than 1.9 hours ago (at the top of the table) do in 
fact have active drivers.


Deploy mode:
 - PySpark drivers running on one node outside the cluster, scheduled by a 
cron-like application, not master supervised
 

Other factoids:
 - In most places, we call sc.stop() explicitly before shutting down our driver 
process
 - Here's the sum total of spark configuration options we don't set to the 
default:
{code}
    "spark.cores.max": 30
    "spark.eventLog.dir": 
"hdfs://nn01.chi.shopify.com:8020/var/spark/event-logs"
    "spark.eventLog.enabled": true
    "spark.executor.memory": "7g"
    "spark.hadoop.fs.defaultFS": "hdfs://nn01.chi.shopify.com:8020/"
    "spark.io.compression.codec": "lzf"
    "spark.ui.killEnabled": true
{code}
 - The exception the executors die with is this:
{code}
14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal handlers 
for [TERM, HUP, INT]
14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban
14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban
14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
users with modify permissions: Set(spark, azkaban)
14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started
14/11/19 19:42:37 INFO Remoting: Starting remoting
14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://[email protected]:37682]
14/11/19 19:42:38 INFO Utils: Successfully started service 'driverPropsFetcher' 
on port 37682.
14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote 
address [akka.tcp://[email protected]:58849]. Address is 
now gated for 5000 ms, all messages to this address will be delivered to dead 
letters. Reason: Connection refused: 
spark-etl1.chi.shopify.com/172.16.126.88:58849
14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException 
as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures 
timed out after [30 seconds]
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: 
Unknown exception in doAs
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1421)
        at 
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
        at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115)
        at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
        at 
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.security.PrivilegedActionException: 
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
        ... 4 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
        at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
        at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
        at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
        at scala.concurrent.Await$.result(package.scala:107)
        at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127)
        at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
        at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
        ... 7 more
{code}


Cluster history:
 - We run spark versions built from apache/spark#master snapshots. We did not 
observe this behaviour on {{7eb9cbc273d758522e787fcb2ef68ef65911475f}} (sorry 
its so old), but now observe it on 
{{c6e0c2ab1c29c184a9302d23ad75e4ccd8060242}}. We can try new versions to assist 
debugging.



> Standalone Master can fail to recognize completed/failed applications
> ---------------------------------------------------------------------
>
>                 Key: SPARK-4498
>                 URL: https://issues.apache.org/jira/browse/SPARK-4498
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.2.0
>         Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
> #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
>  - Standalone Spark built from 
> apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
>  - Python 2.7.3
> java version "1.7.0_71"
> Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
>  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
> memory a piece
>  - All client code is PySpark
>            Reporter: Harry Brundage
>         Attachments: all-master-logs-around-blip.txt, 
> one-applications-master-logs.txt
>
>
> We observe the spark standalone master not detecting that a driver 
> application has completed after the driver process has shut down 
> indefinitely, leaving that driver's resources consumed indefinitely. The 
> master reports applications as Running, but the driver process has long since 
> terminated. The master continually spawns one executor for the application. 
> It boots, times out trying to connect to the driver application, and then 
> dies with the exception below. The master then spawns another executor on a 
> different worker, which does the same thing. The application lives until the 
> master (and workers) are restarted. 
> This happens to many jobs at once, all right around the same time, two or 
> three times a day, where they all get suck. Before and after this "blip" 
> applications start, get resources, finish, and are marked as finished 
> properly. The "blip" is mostly conjecture on my part, I have no hard evidence 
> that it exists other than my identification of the pattern in the Running 
> Applications table. See 
> http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
>  : the applications started before the blip at 1.9 hours ago still have 
> active drivers. All the applications started 1.9 hours ago do not, and the 
> applications started less than 1.9 hours ago (at the top of the table) do in 
> fact have active drivers.
> Deploy mode:
>  - PySpark drivers running on one node outside the cluster, scheduled by a 
> cron-like application, not master supervised
>  
> Other factoids:
>  - In most places, we call sc.stop() explicitly before shutting down our 
> driver process
>  - Here's the sum total of spark configuration options we don't set to the 
> default:
> {code}
>     "spark.cores.max": 30
>     "spark.eventLog.dir": "hdfs://nn.shopify.com:8020/var/spark/event-logs"
>     "spark.eventLog.enabled": true
>     "spark.executor.memory": "7g"
>     "spark.hadoop.fs.defaultFS": "hdfs://nn.shopify.com:8020/"
>     "spark.io.compression.codec": "lzf"
>     "spark.ui.killEnabled": true
> {code}
>  - The exception the executors die with is this:
> {code}
> 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal 
> handlers for [TERM, HUP, INT]
> 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban
> 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban
> 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
> users with modify permissions: Set(spark, azkaban)
> 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started
> 14/11/19 19:42:37 INFO Remoting: Starting remoting
> 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://[email protected]:37682]
> 14/11/19 19:42:38 INFO Utils: Successfully started service 
> 'driverPropsFetcher' on port 37682.
> 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote 
> address [akka.tcp://[email protected]:58849]. Address is 
> now gated for 5000 ms, all messages to this address will be delivered to dead 
> letters. Reason: Connection refused: 
> spark-etl1.chi.shopify.com/172.16.126.88:58849
> 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException 
> as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures 
> timed out after [30 seconds]
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: 
> Unknown exception in doAs
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1421)
>       at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
>       at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115)
>       at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
>       at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: java.security.PrivilegedActionException: 
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
>       ... 4 more
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 
> seconds]
>       at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>       at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>       at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>       at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>       at scala.concurrent.Await$.result(package.scala:107)
>       at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127)
>       at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
>       at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
>       ... 7 more
> {code}
> Cluster history:
>  - We run spark versions built from apache/spark#master snapshots. We did not 
> observe this behaviour on {{7eb9cbc273d758522e787fcb2ef68ef65911475f}} (sorry 
> its so old), but now observe it on 
> {{c6e0c2ab1c29c184a9302d23ad75e4ccd8060242}}. We can try new versions to 
> assist debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications

Reply via email to