Thank you Ted and Sandy for getting me pointed in the right direction. From the 
logs:

WARN yarn.YarnAllocator: Container killed by YARN for exceeding memory limits. 
25.4 GB of 25.3 GB physical memory used. Consider boosting 
spark.yarn.executor.memoryOverhead.


On Nov 19, 2015, at 12:20 PM, Ted Yu 
<yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> wrote:

Here are the parameters related to log aggregation :

    <property>
      <name>yarn.log-aggregation-enable</name>
      <value>true</value>
    </property>

    <property>
      <name>yarn.log-aggregation.retain-seconds</name>
      <value>2592000</value>
    </property>
    <property>
      <name>yarn.nodemanager.log-aggregation.compression-type</name>
      <value>gz</value>
    </property>

    <property>
      <name>yarn.nodemanager.log-aggregation.debug-enabled</name>
      <value>false</value>
    </property>

    <property>
      <name>yarn.nodemanager.log-aggregation.num-log-files-per-app</name>
      <value>30</value>
    </property>

    <property>
      
<name>yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds</name>
      <value>-1</value>
    </property>

On Thu, Nov 19, 2015 at 8:14 AM, 
<ross.cramb...@thomsonreuters.com<mailto:ross.cramb...@thomsonreuters.com>> 
wrote:
Hmm I guess I do not - I get 'application_1445957755572_0176 does not have any 
log files.’ Where can I enable log aggregation?
On Nov 19, 2015, at 11:07 AM, Ted Yu 
<yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> wrote:

Do you have YARN log aggregation enabled ?

You can try retrieving log for the container using the following command:

yarn logs -applicationId application_1445957755572_0176 -containerId 
container_1445957755572_0176_01_000003

Cheers

On Thu, Nov 19, 2015 at 8:02 AM, 
<ross.cramb...@thomsonreuters.com<mailto:ross.cramb...@thomsonreuters.com>> 
wrote:
I am running Spark 1.5.2 on Yarn. My job consists of a number of SparkSQL 
transforms on a JSON data set that I load into a data frame. The data set is 
not large (~100GB) and most stages execute without any issues. However, some 
more complex stages tend to lose executors/nodes regularly. What would cause 
this to happen? The logs don’t give too much information -

15/11/19 15:53:43 ERROR YarnScheduler: Lost executor 2 on 
ip-10-0-0-136.ec2.internal: Yarn deallocated the executor 2 (container 
container_1445957755572_0176_01_000003)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 142.0 in stage 33.0 (TID 8331, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 133.0 in stage 33.0 (TID 8322, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 79.0 in stage 33.0 (TID 8268, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 141.0 in stage 33.0 (TID 8330, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 123.0 in stage 33.0 (TID 8312, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 162.0 in stage 33.0 (TID 8351, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 153.0 in stage 33.0 (TID 8342, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 120.0 in stage 33.0 (TID 8309, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 149.0 in stage 33.0 (TID 8338, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 134.0 in stage 33.0 (TID 8323, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
[Stage 33:===============================>                     (117 + 50) / 
200]15/11/19 15:53:46 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://sparkExecutor@ip-10-0-0-136.ec2.internal:60275] has failed, 
address is now gated for [5000] ms. Reason: [Disassociated]

 - Followed by a list of lost tasks on each executor.




Reply via email to