Since yarn-site.xml was cited, I assume the cluster runs YARN.

On Fri, May 20, 2016 at 12:30 PM, Rodrick Brown <rodr...@orchardplatform.com
> wrote:

> Is this Yarn or Mesos? For the later you need to start an external shuffle
> service.
>
> Get Outlook for iOS <https://aka.ms/o0ukef>
>
>
>
>
> On Fri, May 20, 2016 at 11:48 AM -0700, "Cui, Weifeng" <weife...@a9.com>
> wrote:
>
> Hi guys,
>>
>>
>>
>> Our team has a hadoop 2.6.0 cluster with Spark 1.6.1. We want to set
>> dynamic resource allocation for spark and we followed the following link.
>> After the changes, all spark jobs failed.
>>
>>
>> https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
>>
>> This test was on a test cluster which has 1 master machine (running
>> namenode, resourcemanager and hive server), 1 worker machine (running
>> datanode and nodemanager) and 1 machine as client( running spark shell).
>>
>>
>>
>> *What I updated in config :*
>>
>>
>>
>> 1. Update in spark-defaults.conf
>>
>>         spark.dynamicAllocation.enabled     true
>>
>>         spark.shuffle.service.enabled            true
>>
>>
>>
>> 2. Update yarn-site.xml
>>
>>         <property>
>>
>>              <name>yarn.nodemanager.aux-services</name>
>>               <value>mapreduce_shuffle,*spark_shuffle*</value>
>>         </property>
>>
>>         <property>
>>             <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
>>
>>             <value>org.apache.spark.network.yarn.YarnShuffleService</value>
>>         </property>
>>
>>         <property>
>>             <name>spark.shuffle.service.enabled</name>
>>              <value>true</value>
>>         </property>
>>
>> 3. Copy  spark-1.6.1-yarn-shuffle.jar to yarn.application.classpath
>> ($HADOOP_HOME/share/hadoop/yarn/*) in python code
>>
>> 4. Restart namenode, datanode, resourcemanager, nodemanger...
>> retart everything
>>
>> 5. The config will update in all machines, resourcemanager
>> and nodemanager. We update the config in one place and copy to all machines.
>>
>>
>>
>> *What I tested:*
>>
>>
>>
>> 1. I started a scala spark shell and check its environment variables,
>> spark.dynamicAllocation.enabled is true.
>>
>> 2. I used the following code:
>>
>>         scala > val line =
>> sc.textFile("/spark-events/application_1463681113470_0006")
>>
>>                     line: org.apache.spark.rdd.RDD[String] =
>> /spark-events/application_1463681113470_0006 MapPartitionsRDD[1] at
>> textFile at <console>:27
>>
>>         scala > line.count # This command just stuck here
>>
>>
>>
>> 3. In the beginning, there is only 1 executor(this is for driver) and
>> after line.count, I could see 3 executors, then dropped to 1.
>>
>> 4. Several jobs were launched and all of them failed.   Tasks (for all
>> stages): Succeeded/Total : 0/2 (4 failed)
>>
>>
>>
>> *Error messages:*
>>
>>
>>
>> I found the following messages in spark web UI. I found this in spark.log
>> on nodemanager machine as well.
>>
>>
>>
>> *ExecutorLostFailure (executor 1 exited caused by one of the running
>> tasks) Reason: Container marked as failed:
>> container_1463692924309_0002_01_000002 on host: xxxxxxxxxxxxxxx.com
>> <http://xxxxxxxxxxxxxxx.com>. Exit status: 1. Diagnostics: Exception from
>> container-launch.*
>> *Container id: container_1463692924309_0002_01_000002*
>> *Exit code: 1*
>> *Stack trace: ExitCodeException exitCode=1: *
>> *at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)*
>> *at org.apache.hadoop.util.Shell.run(Shell.java:455)*
>> *at
>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)*
>> *at
>> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)*
>> *at
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)*
>> *at
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)*
>> *at java.util.concurrent.FutureTask.run(FutureTask.java:266)*
>> *at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)*
>> *at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)*
>> *at java.lang.Thread.run(Thread.java:745)*
>>
>> *Container exited with a non-zero exit code 1*
>>
>>
>>
>> Thanks a lot for help. We can provide more information if needed.
>>
>>
>>
>> Thanks,
>> Weifeng
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
> *NOTICE TO RECIPIENTS*: This communication is confidential and intended
> for the use of the addressee only. If you are not an intended recipient of
> this communication, please delete it immediately and notify the sender by
> return email. Unauthorized reading, dissemination, distribution or copying
> of this communication is prohibited. This communication does not constitute
> an offer to sell or a solicitation of an indication of interest to purchase
> any loan, security or any other financial product or instrument, nor is it
> an offer to sell or a solicitation of an indication of interest to purchase
> any products or services to any persons who are prohibited from receiving
> such information under applicable law. The contents of this communication
> may not be accurate or complete and are subject to change without notice.
> As such, Orchard App, Inc. (including its subsidiaries and affiliates,
> "Orchard") makes no representation regarding the accuracy or completeness
> of the information contained herein. The intended recipient is advised to
> consult its own professional advisors, including those specializing in
> legal, tax and accounting matters. Orchard does not provide legal, tax or
> accounting advice.
>

Reply via email to