[ https://issues.apache.org/jira/browse/MAPREDUCE-6393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rohith moved YARN-3788 to MAPREDUCE-6393: ----------------------------------------- Affects Version/s: (was: 2.4.1) 2.4.1 Key: MAPREDUCE-6393 (was: YARN-3788) Project: Hadoop Map/Reduce (was: Hadoop YARN) > Application Master and Task Tracker timeouts are applied incorrectly > -------------------------------------------------------------------- > > Key: MAPREDUCE-6393 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6393 > Project: Hadoop Map/Reduce > Issue Type: Bug > Affects Versions: 2.4.1 > Reporter: Dmitry Sivachenko > > I am running a streaming job which requires a big (~50GB) data file to run > (file is attached via hadoop jar <...> -file BigFile.dat). > Most likely this command will fail as follows (note that error message is > rather meaningless): > 2015-05-27 15:55:00,754 WARN [main] streaming.StreamJob > (StreamJob.java:parseArgv(291)) - -file option is deprecated, please use > generic option -files instead. > packageJobJar: [/ssd/mt/lm/en_reorder.ylm, mapper.py, > /tmp/hadoop-mitya/hadoop-unjar3778165585140840383/] [] > /var/tmp/streamjob633547925483233845.jar tmpDir=null > 2015-05-27 19:46:22,942 INFO [main] client.RMProxy > (RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager at > nezabudka1-00.yandex.ru/5.255.231.129:8032 > 2015-05-27 19:46:23,733 INFO [main] client.RMProxy > (RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager at > nezabudka1-00.yandex.ru/5.255.231.129:8032 > 2015-05-27 20:13:37,231 INFO [main] mapred.FileInputFormat > (FileInputFormat.java:listStatus(247)) - Total input paths to process : 1 > 2015-05-27 20:13:38,110 INFO [main] mapreduce.JobSubmitter > (JobSubmitter.java:submitJobInternal(396)) - number of splits:1 > 2015-05-27 20:13:38,136 INFO [main] Configuration.deprecation > (Configuration.java:warnOnceIfDeprecated(1009)) - mapred.reduce.tasks is > deprecated. Instead, use mapreduce.job.reduces > 2015-05-27 20:13:38,390 INFO [main] mapreduce.JobSubmitter > (JobSubmitter.java:printTokens(479)) - Submitting tokens for job: > job_1431704916575_2531 > 2015-05-27 20:13:38,689 INFO [main] impl.YarnClientImpl > (YarnClientImpl.java:submitApplication(204)) - Submitted application > application_1431704916575_2531 > 2015-05-27 20:13:38,743 INFO [main] mapreduce.Job (Job.java:submit(1289)) - > The url to track the job: > http://nezabudka1-00.yandex.ru:8088/proxy/application_1431704916575_2531/ > 2015-05-27 20:13:38,746 INFO [main] mapreduce.Job > (Job.java:monitorAndPrintJob(1334)) - Running job: job_1431704916575_2531 > 2015-05-27 21:04:12,353 INFO [main] mapreduce.Job > (Job.java:monitorAndPrintJob(1355)) - Job job_1431704916575_2531 running in > uber mode : false > 2015-05-27 21:04:12,356 INFO [main] mapreduce.Job > (Job.java:monitorAndPrintJob(1362)) - map 0% reduce 0% > 2015-05-27 21:04:12,374 INFO [main] mapreduce.Job > (Job.java:monitorAndPrintJob(1375)) - Job job_1431704916575_2531 failed with > state FAILED due to: Application application_1431704916575_2531 failed 2 > times due to ApplicationMaster for attempt > appattempt_1431704916575_2531_000002 timed out. Failing the application. > 2015-05-27 21:04:12,473 INFO [main] mapreduce.Job > (Job.java:monitorAndPrintJob(1380)) - Counters: 0 > 2015-05-27 21:04:12,474 ERROR [main] streaming.StreamJob > (StreamJob.java:submitAndMonitorJob(1019)) - Job not Successful! > Streaming Command Failed! > This is because yarn.am.liveness-monitor.expiry-interval-ms (defaults to 600 > sec) timeout expires before large data file is transferred. > Next step I increase yarn.am.liveness-monitor.expiry-interval-ms. After that > application is successfully initialized and tasks are spawned. > But I encounter another error: the default 600 seconds mapreduce.task.timeout > expires before tasks are initialized and tasks fail. > Error message Task attempt_XXX failed to report status for 600 seconds is > also misleading: this timeout is supposed to kill non-responsive (stuck) > tasks but it rather strikes because auxiliary data files are copying slowly. > So I need to increase mapreduce.task.timeout too and only after that my job > is successful. > At the very least error messages need to be tweaked to indicate that > Application (or Task) is failing because auxiliary files are not copied > during that time, not just generic "timeout expired". > Better solution would be not to account time spent for data files > distribution. -- This message was sent by Atlassian JIRA (v6.3.4#6332)