Hi there I am running 30 APPs in my spark cluster, and some of the APPs got exception like below:[root@slave3 0]# cat stderr 15/06/29 17:20:08 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 15/06/29 17:20:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/06/29 17:20:09 INFO spark.SecurityManager: Changing view acls to: root 15/06/29 17:20:09 INFO spark.SecurityManager: Changing modify acls to: root 15/06/29 17:20:09 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 15/06/29 17:20:09 INFO slf4j.Slf4jLogger: Slf4jLogger started 15/06/29 17:20:09 INFO Remoting: Starting remoting 15/06/29 17:20:10 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher@slave3:51026] 15/06/29 17:20:10 INFO util.Utils: Successfully started service 'driverPropsFetcher' on port 51026. Exception in thread "main" java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1643) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:128) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:224) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:144) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) ... 4 more
when i am running 20 APPs,it is OK. So I doubt this problem looks like executor get disassicated with the driver due to high I/O pressure or network latency.however I have no idea which parameter is spark could fix this. Any idea will be appreciated. Here is some infomation about my cluster:1master and 6workers.every node has 8cores and 12GB memory. And settings in my spark-default.conf and spark-env.sh is like this: spark-default.conf spark.master spark://master:7077 spark.eventLog.enabled true spark.eventLog.dir /var/log/spark spark.serializer org.apache.spark.serializer.KryoSerializer spark.driver.memory 8g spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three" spark.kryoserializer.buffer.max.mb 128 spark.storage.memoryFraction 0.2 spark.shuffle.memoryFraction 0.4 spark.sql.shuffle.partitions 32 spark.scheduler.mode FAIR spark.worker.cleanup.appDataTtl 259200 spark.port.maxRetries 10000 spark.scheduler.maxRegisteredResourcesWaitingTime 40 spark-env.sh:export SPARK_WORKER_INSTANCES=1 export SPARK_EXECUTOR_INSTANCES=8 export SPARK_EXECUTOR_CORES=1 export SPARK_EXECUTOR_MEMORY=1g -------------------------------- Thanks&Best regards! San.Luo