Thanks again. Will look at Mapper.run to understand better. Actually, just a few minutes ago I got the AVROMapper to work (which will read from AVRO files). This will hopefully improve performance even more.
Interesting, AVROMapper doesn't extend from Mapper, so it doesn't have the 'cleanup' method. Instead it provides a 'close' method, which seems to behave the same way. Honestly, I like the method name 'close' better than 'cleanup'. Doug - Is there a reason you chose to not extend from org/apache/hadoop/mapreduce/Mapper? Thank you all for your help. On Fri, Nov 18, 2011 at 7:44 AM, Harsh J <ha...@cloudera.com> wrote: > Given that you are sure about it, and you also know why thats the > case, I'd definitely write inside the cleanup(…) hook. No harm at all > in doing that. > > Take a look at mapreduce.Mapper#run(…) method in source and you'll > understand what I mean by it not being a stage or even an event, but > just a tail call after all map()s are called. > > On Fri, Nov 18, 2011 at 8:58 PM, Something Something > <mailinglist...@gmail.com> wrote: > > Thanks again for the clarification. Not sure what you mean by it's not a > > 'stage'! Okay.. may be not a stage but I think of it as an 'Event', > such as > > 'Mouseover', 'Mouseout'. The 'cleanup' is really a 'MapperCompleted' > event, > > right? > > > > Confusion comes with the name of this method. The name 'cleanup' makes > me > > think it should not be really used as 'mapperCompleted', but it appears > > there's no harm in using it that way. > > > > Here's our dilemma - when we use (local) caching in the Mapper & write in > > the 'cleanup', our job completes in 18 minutes. When we don't write in > > 'cleanup' it takes 3 hours!!! Knowing this if you were to decide, would > you > > use 'cleanup' for this purpose? > > > > Thanks once again for your advice. > > > > > > On Thu, Nov 17, 2011 at 9:35 PM, Harsh J <ha...@cloudera.com> wrote: > >> > >> Hello, > >> > >> On Fri, Nov 18, 2011 at 10:44 AM, Something Something > >> <mailinglist...@gmail.com> wrote: > >> > Thanks for the reply. Here's another concern we have. Let's say > Mapper > >> > has > >> > finished processing 1000 lines from the input file & then the machine > >> > goes > >> > down. I believe Hadoop is smart enough to re-distribute the input > split > >> > that was assigned to this Mapper, correct? After re-assigning will it > >> > reprocess the 1000 lines that were processed successfully before & > start > >> > from line 1001 OR would it reprocess ALL lines? > >> > >> Attempts of any task start afresh. That's the default nature of Hadoop. > >> > >> So, it would begin from start again and hence reprocess ALL lines. > >> Understand that cleanup is just a fancy API call here, thats called > >> after the input reader completes - not a "stage". > >> > >> -- > >> Harsh J > > > > > > > > -- > Harsh J >