Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 sec

Michel Segel Thu, 19 Jan 2012 04:08:40 -0800

Timeout errors don't usually occur outside of the Mapper.map() 'phase'.

When we've seen this error it has to deal w M/R going against HBase....


Since the OP sees the error when he does a bulk 'write', but it stops when he 
reduces the number of writes ... That kind of suggests where the problem occurs 
...

Unless of course I missed something...

Sent from a remote device. Please excuse any typos...

Mike Segel

On Jan 18, 2012, at 9:28 PM, Raj Vishwanthan <[email protected]> wrote:

> You can try the following
> - make it into a map only job (for debug  purposes)
> - start your shuffle phase after all the maps are complete( there is a 
> parameter for this)
> -characterize your disks for performance
> 
> Raj 
> 
> 
> Sent from Samsung Mobile
> 
> Steve Lewis <[email protected]> wrote:
> 
> In my hands the problem occurs in all map jobs - an associate with a 
> different cluster - mine has 8 nodes - his 40 reports 80% of map tasks fail 
> with a few succeeding - 
> I suspect some kind of an I/O waiot but fail to see how it gets to 600sec
> 
> On Wed, Jan 18, 2012 at 4:50 PM, Raj V <[email protected]> wrote:
> Steve
> 
> Does the timeout happen for all the map jobs? Are you using some kind of 
> shared storage for map outputs? Any problems with the physical disks? If the 
> shuffle phase has started could the disks be I/O waiting between the read and 
> write?
> 
> Raj
> 
> 
> 
>> ________________________________
>> From: Steve Lewis <[email protected]>
>> To: [email protected]
>> Sent: Wednesday, January 18, 2012 4:21 PM
>> Subject: Re: I am trying to run a large job and it is consistently failing 
>> with timeout - nothing happens for 600 sec
>> 
>> 1) I do a lot of progress reporting
>> 2) Why would the job succeed when the only change in the code is
>>       if(NumberWrites++ % 100 == 0)
>>               context.write(key,value);
>> comment out the test  allowing full writes and the job fails
>> Since every write is a report I assume that something in the write code or
>> other hadoop code for dealing with output if failing. I do increment a
>> counter for every write or in the case of the above code potential write
>> What I am seeing is that where ever the timeout occurs it is not in a place
>> where I am capable of inserting more reporting
>> 
>> 
>> 
>> On Wed, Jan 18, 2012 at 4:01 PM, Leonardo Urbina <[email protected]> wrote:
>> 
>>> Perhaps you are not reporting progress throughout your task. If you
>>> happen to run a job large enough job you hit the the default timeout
>>> mapred.task.timeout  (that defaults to 10 min). Perhaps you should
>>> consider reporting progress in your mapper/reducer by calling
>>> progress() on the Reporter object. Check tip 7 of this link:
>>> 
>>> http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/
>>> 
>>> Hope that helps,
>>> -Leo
>>> 
>>> Sent from my phone
>>> 
>>> On Jan 18, 2012, at 6:46 PM, Steve Lewis <[email protected]> wrote:
>>> 
>>>> I KNOW is is a task timeout - what I do NOT know is WHY merely cutting
>>> the
>>>> number of writes causes it to go away. It seems to imply that some
>>>> context.write operation or something downstream from that is taking a
>>> huge
>>>> amount of time and that is all hadoop internal code - not mine so my
>>>> question is why should increasing the number and volume of wriotes cause
>>> a
>>>> task to time out
>>>> 
>>>> On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez <[email protected]> wrote:
>>>> 
>>>>> Sounds like mapred.task.timeout?  The default is 10 minutes.
>>>>> 
>>>>> http://hadoop.apache.org/common/docs/current/mapred-default.html
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Tom
>>>>> 
>>>>> On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <[email protected]>
>>>>> wrote:
>>>>>> The map tasks fail timing out after 600 sec.
>>>>>> I am processing one 9 GB file with 16,000,000 records. Each record
>>> (think
>>>>>> is it as a line)  generates hundreds of key value pairs.
>>>>>> The job is unusual in that the output of the mapper in terms of records
>>>>> or
>>>>>> bytes orders of magnitude larger than the input.
>>>>>> I have no idea what is slowing down the job except that the problem is
>>> in
>>>>>> the writes.
>>>>>> 
>>>>>> If I change the job to merely bypass a fraction of the context.write
>>>>>> statements the job succeeds.
>>>>>> This is one map task that failed and one that succeeded - I cannot
>>>>>> understand how a write can take so long
>>>>>> or what else the mapper might be doing
>>>>>> 
>>>>>> JOB FAILED WITH TIMEOUT
>>>>>> 
>>>>>> *Parser*TotalProteins90,103NumberFragments10,933,089
>>>>>> 
>>>>> 
>>> *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807
>>>>>> *Map-Reduce Framework*Combine output records10,033,499Map input records
>>>>>> 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine
>>> input
>>>>>> records10,844,881Map output records10,933,089
>>>>>> Same code but fewer writes
>>>>>> JOB SUCCEEDED
>>>>>> 
>>>>>> *Parser*TotalProteins90,103NumberFragments206,658,758
>>>>>> *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607
>>>>>> FILE_BYTES_WRITTEN220,169,922
>>>>>> *Map-Reduce Framework*Combine output records4,046,128Map input
>>>>>> records90,103Spilled
>>>>>> Records4,046,128Map output bytes662,354,413Combine input
>>>>> records4,098,609Map
>>>>>> output records2,066,588
>>>>>> Any bright ideas
>>>>>> --
>>>>>> Steven M. Lewis PhD
>>>>>> 4221 105th Ave NE
>>>>>> Kirkland, WA 98033
>>>>>> 206-384-1340 (cell)
>>>>>> Skype lordjoe_com
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Steven M. Lewis PhD
>>>> 4221 105th Ave NE
>>>> Kirkland, WA 98033
>>>> 206-384-1340 (cell)
>>>> Skype lordjoe_com
>>> 
>> 
>> 
>> 
>> --
>> Steven M. Lewis PhD
>> 4221 105th Ave NE
>> Kirkland, WA 98033
>> 206-384-1340 (cell)
>> Skype lordjoe_com
>> 
>> 
>> 
> 
> 
> 
> -- 
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
> 
> 
> 
> 
> TODAY(Beta) • Powered by Yahoo!
> TV chefs' feud heats up over diabetes
> Anthony Bourdain takes a jab at Paula Deen after she reveals her diagnosis.
> Privacy Policy

Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 sec

Reply via email to