Here are the parameters related to log aggregation : <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property>
<property> <name>yarn.log-aggregation.retain-seconds</name> <value>2592000</value> </property> <property> <name>yarn.nodemanager.log-aggregation.compression-type</name> <value>gz</value> </property> <property> <name>yarn.nodemanager.log-aggregation.debug-enabled</name> <value>false</value> </property> <property> <name>yarn.nodemanager.log-aggregation.num-log-files-per-app</name> <value>30</value> </property> <property> <name>yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds</name> <value>-1</value> </property> On Thu, Nov 19, 2015 at 8:14 AM, <ross.cramb...@thomsonreuters.com> wrote: > Hmm I guess I do not - I get 'application_1445957755572_0176 does not > have any log files.’ Where can I enable log aggregation? > > On Nov 19, 2015, at 11:07 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > Do you have YARN log aggregation enabled ? > > You can try retrieving log for the container using the following command: > > yarn logs -applicationId application_1445957755572_0176 > -containerId container_1445957755572_0176_01_000003 > > Cheers > > On Thu, Nov 19, 2015 at 8:02 AM, <ross.cramb...@thomsonreuters.com> wrote: > >> I am running Spark 1.5.2 on Yarn. My job consists of a number of SparkSQL >> transforms on a JSON data set that I load into a data frame. The data set >> is not large (~100GB) and most stages execute without any issues. However, >> some more complex stages tend to lose executors/nodes regularly. What would >> cause this to happen? The logs don’t give too much information - >> >> 15/11/19 15:53:43 ERROR YarnScheduler: Lost executor 2 on >> ip-10-0-0-136.ec2.internal: Yarn deallocated the executor 2 (container >> container_1445957755572_0176_01_000003) >> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 142.0 in stage 33.0 (TID >> 8331, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost) >> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 133.0 in stage 33.0 (TID >> 8322, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost) >> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 79.0 in stage 33.0 (TID >> 8268, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost) >> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 141.0 in stage 33.0 (TID >> 8330, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost) >> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 123.0 in stage 33.0 (TID >> 8312, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost) >> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 162.0 in stage 33.0 (TID >> 8351, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost) >> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 153.0 in stage 33.0 (TID >> 8342, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost) >> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 120.0 in stage 33.0 (TID >> 8309, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost) >> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 149.0 in stage 33.0 (TID >> 8338, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost) >> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 134.0 in stage 33.0 (TID >> 8323, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost) >> [Stage 33:===============================> (117 + 50) >> / 200]15/11/19 15:53:46 WARN ReliableDeliverySupervisor: Association with >> remote system [akka.tcp://sparkExecutor@ip-10-0-0-136.ec2.internal:60275] >> has failed, address is now gated for [5000] ms. Reason: [Disassociated] >> >> - Followed by a list of lost tasks on each executor. > > > >