Timeout errors don't usually occur outside of the Mapper.map() 'phase'. When we've seen this error it has to deal w M/R going against HBase....
Since the OP sees the error when he does a bulk 'write', but it stops when he reduces the number of writes ... That kind of suggests where the problem occurs ... Unless of course I missed something... Sent from a remote device. Please excuse any typos... Mike Segel On Jan 18, 2012, at 9:28 PM, Raj Vishwanthan <[email protected]> wrote: > You can try the following > - make it into a map only job (for debug purposes) > - start your shuffle phase after all the maps are complete( there is a > parameter for this) > -characterize your disks for performance > > Raj > > > Sent from Samsung Mobile > > Steve Lewis <[email protected]> wrote: > > In my hands the problem occurs in all map jobs - an associate with a > different cluster - mine has 8 nodes - his 40 reports 80% of map tasks fail > with a few succeeding - > I suspect some kind of an I/O waiot but fail to see how it gets to 600sec > > On Wed, Jan 18, 2012 at 4:50 PM, Raj V <[email protected]> wrote: > Steve > > Does the timeout happen for all the map jobs? Are you using some kind of > shared storage for map outputs? Any problems with the physical disks? If the > shuffle phase has started could the disks be I/O waiting between the read and > write? > > Raj > > > >> ________________________________ >> From: Steve Lewis <[email protected]> >> To: [email protected] >> Sent: Wednesday, January 18, 2012 4:21 PM >> Subject: Re: I am trying to run a large job and it is consistently failing >> with timeout - nothing happens for 600 sec >> >> 1) I do a lot of progress reporting >> 2) Why would the job succeed when the only change in the code is >> if(NumberWrites++ % 100 == 0) >> context.write(key,value); >> comment out the test allowing full writes and the job fails >> Since every write is a report I assume that something in the write code or >> other hadoop code for dealing with output if failing. I do increment a >> counter for every write or in the case of the above code potential write >> What I am seeing is that where ever the timeout occurs it is not in a place >> where I am capable of inserting more reporting >> >> >> >> On Wed, Jan 18, 2012 at 4:01 PM, Leonardo Urbina <[email protected]> wrote: >> >>> Perhaps you are not reporting progress throughout your task. If you >>> happen to run a job large enough job you hit the the default timeout >>> mapred.task.timeout (that defaults to 10 min). Perhaps you should >>> consider reporting progress in your mapper/reducer by calling >>> progress() on the Reporter object. Check tip 7 of this link: >>> >>> http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/ >>> >>> Hope that helps, >>> -Leo >>> >>> Sent from my phone >>> >>> On Jan 18, 2012, at 6:46 PM, Steve Lewis <[email protected]> wrote: >>> >>>> I KNOW is is a task timeout - what I do NOT know is WHY merely cutting >>> the >>>> number of writes causes it to go away. It seems to imply that some >>>> context.write operation or something downstream from that is taking a >>> huge >>>> amount of time and that is all hadoop internal code - not mine so my >>>> question is why should increasing the number and volume of wriotes cause >>> a >>>> task to time out >>>> >>>> On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez <[email protected]> wrote: >>>> >>>>> Sounds like mapred.task.timeout? The default is 10 minutes. >>>>> >>>>> http://hadoop.apache.org/common/docs/current/mapred-default.html >>>>> >>>>> Thanks, >>>>> >>>>> Tom >>>>> >>>>> On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <[email protected]> >>>>> wrote: >>>>>> The map tasks fail timing out after 600 sec. >>>>>> I am processing one 9 GB file with 16,000,000 records. Each record >>> (think >>>>>> is it as a line) generates hundreds of key value pairs. >>>>>> The job is unusual in that the output of the mapper in terms of records >>>>> or >>>>>> bytes orders of magnitude larger than the input. >>>>>> I have no idea what is slowing down the job except that the problem is >>> in >>>>>> the writes. >>>>>> >>>>>> If I change the job to merely bypass a fraction of the context.write >>>>>> statements the job succeeds. >>>>>> This is one map task that failed and one that succeeded - I cannot >>>>>> understand how a write can take so long >>>>>> or what else the mapper might be doing >>>>>> >>>>>> JOB FAILED WITH TIMEOUT >>>>>> >>>>>> *Parser*TotalProteins90,103NumberFragments10,933,089 >>>>>> >>>>> >>> *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807 >>>>>> *Map-Reduce Framework*Combine output records10,033,499Map input records >>>>>> 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine >>> input >>>>>> records10,844,881Map output records10,933,089 >>>>>> Same code but fewer writes >>>>>> JOB SUCCEEDED >>>>>> >>>>>> *Parser*TotalProteins90,103NumberFragments206,658,758 >>>>>> *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607 >>>>>> FILE_BYTES_WRITTEN220,169,922 >>>>>> *Map-Reduce Framework*Combine output records4,046,128Map input >>>>>> records90,103Spilled >>>>>> Records4,046,128Map output bytes662,354,413Combine input >>>>> records4,098,609Map >>>>>> output records2,066,588 >>>>>> Any bright ideas >>>>>> -- >>>>>> Steven M. Lewis PhD >>>>>> 4221 105th Ave NE >>>>>> Kirkland, WA 98033 >>>>>> 206-384-1340 (cell) >>>>>> Skype lordjoe_com >>>>> >>>> >>>> >>>> >>>> -- >>>> Steven M. Lewis PhD >>>> 4221 105th Ave NE >>>> Kirkland, WA 98033 >>>> 206-384-1340 (cell) >>>> Skype lordjoe_com >>> >> >> >> >> -- >> Steven M. Lewis PhD >> 4221 105th Ave NE >> Kirkland, WA 98033 >> 206-384-1340 (cell) >> Skype lordjoe_com >> >> >> > > > > -- > Steven M. Lewis PhD > 4221 105th Ave NE > Kirkland, WA 98033 > 206-384-1340 (cell) > Skype lordjoe_com > > > > > TODAY(Beta) • Powered by Yahoo! > TV chefs' feud heats up over diabetes > Anthony Bourdain takes a jab at Paula Deen after she reveals her diagnosis. > Privacy Policy
