Thanks for the reply. Here's another concern we have. Let's say Mapper has finished processing 1000 lines from the input file & then the machine goes down. I believe Hadoop is smart enough to re-distribute the input split that was assigned to this Mapper, correct? After re-assigning will it reprocess the 1000 lines that were processed successfully before & start from line 1001 OR would it reprocess ALL lines?
On Wed, Nov 16, 2011 at 9:42 PM, Harsh J <ha...@cloudera.com> wrote: > I'm sure you understand all implications here so I'll just answer your > questions, inline. > > On Thu, Nov 17, 2011 at 9:53 AM, Something Something > <mailinglist...@gmail.com> wrote: > > Is the idea of writing business logic in cleanup method of a Mapper good > or > > bad? We think we can make our Mapper run faster if we keep accumulating > > data in a HashMap in a Mapper, and later in the cleanup() method write > it. > > You can certainly write it during cleanup() call. Streams are only > closed after thats done, so no issues framework-wise. > > > 1) Does Map/Reduce paradigm guarantee that cleanup will always be called > > before the reducer starts? > > Reducers start reducing only after all Map Tasks have completed > (Tasks, on the whole level). So, yes. This is guaranteed. > > > 2) Is cleanup strictly for cleaning up unneeded resources? > > Yes, it was provided for that purpose. > > > 3) We understand that the HashMap can grow & that could cause memory > > issues, but hypothetically let's say the memory requirements > > were manageable. > > You are also pushing the whole write load to after the reads. It is > almost 1:1 otherwise. > > P.s. Perhaps try overriding Mapper#run if you'd like complete control > on how a Mapper executes in stages. > > -- > Harsh J >