Thanks for the reply.  Here's another concern we have.  Let's say Mapper
has finished processing 1000 lines from the input file & then the machine
goes down.  I believe Hadoop is smart enough to re-distribute the input
split that was assigned to this Mapper, correct?  After re-assigning will
it reprocess the 1000 lines that were processed successfully before & start
from line 1001  OR  would it reprocess ALL lines?



On Wed, Nov 16, 2011 at 9:42 PM, Harsh J <ha...@cloudera.com> wrote:

> I'm sure you understand all implications here so I'll just answer your
> questions, inline.
>
> On Thu, Nov 17, 2011 at 9:53 AM, Something Something
> <mailinglist...@gmail.com> wrote:
> > Is the idea of writing business logic in cleanup method of a Mapper good
> or
> > bad?  We think we can make our Mapper run faster if we keep accumulating
> > data in a HashMap in a Mapper, and later in the cleanup() method write
> it.
>
> You can certainly write it during cleanup() call. Streams are only
> closed after thats done, so no issues framework-wise.
>
> > 1)  Does Map/Reduce paradigm guarantee that cleanup will always be called
> > before the reducer starts?
>
> Reducers start reducing only after all Map Tasks have completed
> (Tasks, on the whole level). So, yes. This is guaranteed.
>
> > 2)  Is cleanup strictly for cleaning up unneeded resources?
>
> Yes, it was provided for that purpose.
>
> > 3)  We understand that the HashMap can grow & that could cause memory
> > issues, but hypothetically let's say the memory requirements
> > were manageable.
>
> You are also pushing the whole write load to after the reads. It is
> almost 1:1 otherwise.
>
> P.s. Perhaps try overriding Mapper#run if you'd like complete control
> on how a Mapper executes in stages.
>
> --
> Harsh J
>

Reply via email to