Hi guys,


Our team has a hadoop 2.6.0 cluster with Spark 1.6.1. We want to set dynamic 
resource allocation for spark and we followed the following link. After the 
changes, all spark jobs failed.
https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation

This test was on a test cluster which has 1 master machine (running namenode, 
resourcemanager and hive server), 1 worker machine (running datanode and 
nodemanager) and 1 machine as client( running spark shell).



What I updated in config :



1. Update in spark-defaults.conf

        spark.dynamicAllocation.enabled     true
        spark.shuffle.service.enabled            true



2. Update yarn-site.xml

        <property>
             <name>yarn.nodemanager.aux-services</name>
              <value>mapreduce_shuffle,spark_shuffle</value>
        </property>

        <property>
            <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
            <value>org.apache.spark.network.yarn.YarnShuffleService</value>
        </property>

        <property>
            <name>spark.shuffle.service.enabled</name>
             <value>true</value>
        </property>

3. Copy  spark-1.6.1-yarn-shuffle.jar to yarn.application.classpath 
($HADOOP_HOME/share/hadoop/yarn/*) in python code

4. Restart namenode, datanode, resourcemanager, nodemanger... retart everything

5. The config will update in all machines, resourcemanager and nodemanager. We 
update the config in one place and copy to all machines.



What I tested:



1. I started a scala spark shell and check its environment variables, 
spark.dynamicAllocation.enabled is true.

2. I used the following code:

        scala > val line = 
sc.textFile("/spark-events/application_1463681113470_0006")

                    line: org.apache.spark.rdd.RDD[String] = 
/spark-events/application_1463681113470_0006 MapPartitionsRDD[1] at textFile at 
<console>:27

        scala > line.count # This command just stuck here



3. In the beginning, there is only 1 executor(this is for driver) and after 
line.count, I could see 3 executors, then dropped to 1.

4. Several jobs were launched and all of them failed.   Tasks (for all stages): 
Succeeded/Total : 0/2 (4 failed)



Error messages:



I found the following messages in spark web UI. I found this in spark.log on 
nodemanager machine as well.


ExecutorLostFailure (executor 1 exited caused by one of the running tasks) 
Reason: Container marked as failed: container_1463692924309_0002_01_000002 on 
host: xxxxxxxxxxxxxxx.com. Exit status: 1. Diagnostics: Exception from 
container-launch.
Container id: container_1463692924309_0002_01_000002
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 1



Thanks a lot for help. We can provide more information if needed.



Thanks,
Weifeng










Reply via email to