Agreed.  My thought was to just re-name the intermediate file with the
extension ".sorted" that way the files don't need a special metadata
section.  I'll add your comments to the JIRA enhancement request.

On 1/25/07, Bryan A. P. Pendleton <[EMAIL PROTECTED]> wrote:

Seems like it wouldn't be more expensive than a few calls to the
appropriate
Comparator to figure this out - the OutputCollector merely compares each
output key to the previously output key. If order is preserved, output
this
extra truth when the "spill" to disk happens as a header field. If not,
you
can stop calling the comparator as soon as output fails to be ordered a
single time. In any case, this means that sorts can be skipped on any
output
sequences that are already sorted, and only applied to output sequences
that
aren't.

On 1/25/07, Doug Judd <[EMAIL PROTECTED]> wrote:
>
> Thinking about this a little more, the RDBMS join example is not such a
> good
> one, since you have to sort by foreign key anyway, and you can do this
> sort
> and merge as a single normal Map-reduce job.  However, there are cases
> where
> you know that the output of the map() phase is already "sorted".  You
> should
> be able to set a flag telling the mapred framework that this is the case
> so
> that in the Reduce phase, the files are simply merged together in one
pass
> instead of having to do two passes, first verifying that the input is
> sorted
> and a second pass to do the merge.
>
> - Doug
>
> On 1/25/07, Doug Judd <[EMAIL PROTECTED]> wrote:
> >
> > Comments inline ...
> >
> > On 1/25/07, Bryan A. P. Pendleton <[EMAIL PROTECTED]> wrote:
> > >
> > > If the output is already sorted, the sort pass *should* be able to
run
> > > in
> > > linear time - perhaps not worth optimizing it out for cases of
sorted
> > > output.
> >
> >
> > Agreed.  There's no reason why the framework can't detect this and
> > automatically skip the n log(n) sort.
> >
> > Given the scatter/reassemble nature of the default map/reduce
> > > (scatter by Partition, by default by the Hash), inputs that are
sorted
> > > may
> > > not be written as such to output..... so, if you're counting on
sorted
> > > data,
> > > maybe it's best to leave the sort in (and verify that the current
> > > infrastructure will perform well given sorted input). Otherwise, if
> > > there is
> > > no implication/need of sorted output, then sort can be totally
> disabled.
> >
> >
> > I guess when I say "sorted" I really mean aggregated.  If you know the
> > input is "aggregated" (e.g. it's the output of a previous Map/Reduce
> job)
> > and the map() function preserves this aggregation, then the step of
> building
> > the intermediate results from the map output should be done in linear
> time.
> >
> > I do feel that this optimization is important.  As Adrezej points out,
> > there are quite a few applications that could benefit from this.  In
> > particular, the join example.  Without it, you can't really justify
> > replacing your large RDBMS system with a Map-Reduce (Bigtable)
> > implementation.
> >
> >  - Doug
> >
> >
> >
> >
>
>


--
Bryan A. P. Pendleton
Ph: (877) geek-1-bp


Reply via email to