Seems like it wouldn't be more expensive than a few calls to the appropriate
Comparator to figure this out - the OutputCollector merely compares each
output key to the previously output key. If order is preserved, output this
extra truth when the "spill" to disk happens as a header field. If not, you
can stop calling the comparator as soon as output fails to be ordered a
single time. In any case, this means that sorts can be skipped on any output
sequences that are already sorted, and only applied to output sequences that
aren't.

On 1/25/07, Doug Judd <[EMAIL PROTECTED]> wrote:

Thinking about this a little more, the RDBMS join example is not such a
good
one, since you have to sort by foreign key anyway, and you can do this
sort
and merge as a single normal Map-reduce job.  However, there are cases
where
you know that the output of the map() phase is already "sorted".  You
should
be able to set a flag telling the mapred framework that this is the case
so
that in the Reduce phase, the files are simply merged together in one pass
instead of having to do two passes, first verifying that the input is
sorted
and a second pass to do the merge.

- Doug

On 1/25/07, Doug Judd <[EMAIL PROTECTED]> wrote:
>
> Comments inline ...
>
> On 1/25/07, Bryan A. P. Pendleton <[EMAIL PROTECTED]> wrote:
> >
> > If the output is already sorted, the sort pass *should* be able to run
> > in
> > linear time - perhaps not worth optimizing it out for cases of sorted
> > output.
>
>
> Agreed.  There's no reason why the framework can't detect this and
> automatically skip the n log(n) sort.
>
> Given the scatter/reassemble nature of the default map/reduce
> > (scatter by Partition, by default by the Hash), inputs that are sorted
> > may
> > not be written as such to output..... so, if you're counting on sorted
> > data,
> > maybe it's best to leave the sort in (and verify that the current
> > infrastructure will perform well given sorted input). Otherwise, if
> > there is
> > no implication/need of sorted output, then sort can be totally
disabled.
>
>
> I guess when I say "sorted" I really mean aggregated.  If you know the
> input is "aggregated" (e.g. it's the output of a previous Map/Reduce
job)
> and the map() function preserves this aggregation, then the step of
building
> the intermediate results from the map output should be done in linear
time.
>
> I do feel that this optimization is important.  As Adrezej points out,
> there are quite a few applications that could benefit from this.  In
> particular, the join example.  Without it, you can't really justify
> replacing your large RDBMS system with a Map-Reduce (Bigtable)
> implementation.
>
>  - Doug
>
>
>
>




--
Bryan A. P. Pendleton
Ph: (877) geek-1-bp

Reply via email to