Hi Ross,

This is most likely occurring because YARN is killing containers for
exceeding physical memory limits.  You can make this less likely to happen
by bumping spark.yarn.executor.memoryOverhead to something higher than 10%
of your spark.executor.memory.

-Sandy

On Thu, Nov 19, 2015 at 8:14 AM, <ross.cramb...@thomsonreuters.com> wrote:

> Hmm I guess I do not - I get 'application_1445957755572_0176 does not
> have any log files.’ Where can I enable log aggregation?
>
> On Nov 19, 2015, at 11:07 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
> Do you have YARN log aggregation enabled ?
>
> You can try retrieving log for the container using the following command:
>
> yarn logs -applicationId application_1445957755572_0176
>  -containerId container_1445957755572_0176_01_000003
>
> Cheers
>
> On Thu, Nov 19, 2015 at 8:02 AM, <ross.cramb...@thomsonreuters.com> wrote:
>
>> I am running Spark 1.5.2 on Yarn. My job consists of a number of SparkSQL
>> transforms on a JSON data set that I load into a data frame. The data set
>> is not large (~100GB) and most stages execute without any issues. However,
>> some more complex stages tend to lose executors/nodes regularly. What would
>> cause this to happen? The logs don’t give too much information -
>>
>> 15/11/19 15:53:43 ERROR YarnScheduler: Lost executor 2 on
>> ip-10-0-0-136.ec2.internal: Yarn deallocated the executor 2 (container
>> container_1445957755572_0176_01_000003)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 142.0 in stage 33.0 (TID
>> 8331, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 133.0 in stage 33.0 (TID
>> 8322, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 79.0 in stage 33.0 (TID
>> 8268, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 141.0 in stage 33.0 (TID
>> 8330, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 123.0 in stage 33.0 (TID
>> 8312, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 162.0 in stage 33.0 (TID
>> 8351, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 153.0 in stage 33.0 (TID
>> 8342, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 120.0 in stage 33.0 (TID
>> 8309, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 149.0 in stage 33.0 (TID
>> 8338, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 134.0 in stage 33.0 (TID
>> 8323, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> [Stage 33:===============================>                     (117 + 50)
>> / 200]15/11/19 15:53:46 WARN ReliableDeliverySupervisor: Association with
>> remote system [akka.tcp://sparkExecutor@ip-10-0-0-136.ec2.internal:60275]
>> has failed, address is now gated for [5000] ms. Reason: [Disassociated]
>>
>>  - Followed by a list of lost tasks on each executor.
>
>
>
>

Reply via email to