It may be simpler to just have a post processing step that uses something
like multi-file input to aggregate the results.

As a complete sideways thinking solution, I suspect you have far more map
tasks than you have physical machines,
instead of writing your output via output.collect, your tasks could open a
'side effect file' and append to it, since these are in the local file
system you actually have the ability to append to them. You will need to
play some interesting games with the OutputCommitter though.

An alternative would be to write N output records, where N is the number of
reduces, where each of the N keys is guaranteed to to a unique reduce task,
and the value of the record is the local file name and the host name.
The side effect files would need to be written into the job working area or
some public area on the node., rather than the task output area, or the
output committer could place them in the proper place (that way failed tasks
are handled correctly).

The reduce then reads the keys it has, opens and concatinates what files are
on it's machine, and very very little sorting happens.



Each reduce then collects the side effect files that

2009/4/28 Dmitry Pushkarev <u...@stanford.edu>

> Hi.
>
>
>
> I'm writing streaming based tasks that involves running thousands of
> mappers, after that I want to put all these outputs into small number (say
> 30) output files mainly so that disk space will be used more efficiently,
> the way I'm doing it right now is using /bin/cat as reducer and setting
> number of reducers to desired. This involves two highly ineffective (for
> the
> task) steps - sorting and fetching.  Is there a way to get around that?
>
> Ideally I'd want all mapper outputs to be written to one file, one record
> per line.
>
>
>
> Thanks.
>
>
>
> ---
>
> Dmitry Pushkarev
>
> +1-650-644-8988
>
>
>
>


-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Reply via email to