Re: MRv2 jobs fail when run with more than one slave

Trevor Tue, 17 Jul 2012 17:33:55 -0700

Actually, the HTTP 400 is a red herring, and not the core issue. I added
"-D mapreduce.client.output.filter=ALL" to the command line, and fetching
the task output fails even for successful tasks:


12/07/17 19:15:55 INFO mapreduce.Job: Task Id :
attempt_1342570404456_0001_m_000006_1, Status : SUCCEEDED
12/07/17 19:15:55 WARN mapreduce.Job: Error reading task output Server
returned HTTP response code: 400 for URL:
http://perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342570404456_0001_m_000006_1&filter=stdout

Having a better idea what to search for, I found that it's a recently fixed
bug: https://issues.apache.org/jira/browse/MAPREDUCE-3889

So the real question is how can I debug the failing tasks on the non-AM
slave(s)? Although I see failure on the client:

12/07/17 19:14:35 INFO mapreduce.Job: Task Id :
attempt_1342570404456_0001_m_000002_0, Status : FAILED

I see what appears to be success on the slave:

2012-07-17 19:13:47,476 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
Container container_1342570404456_0001_01_000002 succeeded
2012-07-17 19:13:47,477 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Container container_1342570404456_0001_01_000002 transitioned from RUNNING
to EXITED_WITH_SUCCESS

Suggestions of where to look next?

Thanks,
Trevor

On Tue, Jul 17, 2012 at 6:33 PM, Trevor <tre...@scurrilous.com> wrote:

> Arun, I just verified that I get the same error with 2.0.0-alpha (official
> tarball) and 2.0.1-alpha (built from svn).
>
> Karthik, thanks for forwarding.
>
> Thanks,
> Trevor
>
>
> On Tue, Jul 17, 2012 at 6:18 PM, Karthik Kambatla <ka...@cloudera.com>wrote:
>
>> Forwarding your email to the cdh-user group.
>>
>> Thanks
>> Karthik
>>
>>
>> On Tue, Jul 17, 2012 at 2:24 PM, Trevor <tre...@scurrilous.com> wrote:
>>
>>> Hi all,
>>>
>>> I recently upgraded from CDH4b2 (0.23.1) to CDH4 (2.0.0). Now for some
>>> strange reason, my MRv2 jobs (TeraGen, specifically) fail if I run with
>>> more than one slave. For every slave except the one running the Application
>>> Master, I get the following failed tasks and warnings repeatedly:
>>>
>>> 12/07/13 14:21:55 INFO mapreduce.Job: Running job: job_1342207265272_0001
>>> 12/07/13 14:22:17 INFO mapreduce.Job: Job job_1342207265272_0001 running
>>> in uber mode : false
>>> 12/07/13 14:22:17 INFO mapreduce.Job:  map 0% reduce 0%
>>> 12/07/13 14:22:46 INFO mapreduce.Job:  map 1% reduce 0%
>>> 12/07/13 14:22:52 INFO mapreduce.Job:  map 2% reduce 0%
>>> 12/07/13 14:22:55 INFO mapreduce.Job:  map 3% reduce 0%
>>> 12/07/13 14:22:58 INFO mapreduce.Job:  map 4% reduce 0%
>>> 12/07/13 14:23:04 INFO mapreduce.Job:  map 5% reduce 0%
>>> 12/07/13 14:23:07 INFO mapreduce.Job:  map 6% reduce 0%
>>> 12/07/13 14:23:07 INFO mapreduce.Job: Task Id :
>>> attempt_1342207265272_0001_m_000004_0, Status : FAILED
>>> 12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server
>>> returned HTTP response code: 400 for URL: http://
>>>
>>> perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000004_0&filter=stdout
>>> 12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server
>>> returned HTTP response code: 400 for URL: http://
>>>
>>> perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000004_0&filter=stderr
>>> 12/07/13 14:23:08 INFO mapreduce.Job: Task Id :
>>> attempt_1342207265272_0001_m_000003_0, Status : FAILED
>>> 12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server
>>> returned HTTP response code: 400 for URL: http://
>>>
>>> perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000003_0&filter=stdout
>>> ...
>>> 12/07/13 14:25:12 INFO mapreduce.Job:  map 25% reduce 0%
>>> 12/07/13 14:25:12 INFO mapreduce.Job: Job job_1342207265272_0001 failed
>>> with state FAILED due to:
>>> ...
>>>                 Failed map tasks=19
>>>                 Launched map tasks=31
>>>
>>> The HTTP 400 error appears to be generated by the ShuffleHandler, which
>>> is configured to run on port 8080 of the slaves, and doesn't understand
>>> that URL. What I've been able to piece together so far is that /tasklog is
>>> handled by the TaskLogServlet, which is part of the TaskTracker. However,
>>> isn't this an MRv1 class that shouldn't even be running in my
>>> configuration? Also, the TaskTracker appears to run on port 50060, so I
>>> don't know where port 8080 is coming from.
>>>
>>> Though it could be a red herring, this warning seems to be related to
>>> the job failing, despite the fact that the job makes progress on the slave
>>> running the AM. The Node Manager logs on both AM and non-AM slaves appear
>>> fairly similar, and I don't see any errors in the non-AM logs.
>>>
>>> Another strange data point: These failures occur running the slaves on
>>> ARM systems. Running the slaves on x86 with the same configuration works.
>>> I'm using the same tarball on both, which means that the native-hadoop
>>> library isn't loaded on ARM. The master/client is the same x86 system in
>>> both scenarios. All nodes are running Ubuntu 12.04.
>>>
>>> Thanks for any guidance,
>>> Trevor
>>>
>>>
>>
>

Re: MRv2 jobs fail when run with more than one slave

Reply via email to