Agreed. My thought was to just re-name the intermediate file with the extension ".sorted" that way the files don't need a special metadata section. I'll add your comments to the JIRA enhancement request.
On 1/25/07, Bryan A. P. Pendleton <[EMAIL PROTECTED]> wrote:
Seems like it wouldn't be more expensive than a few calls to the appropriate Comparator to figure this out - the OutputCollector merely compares each output key to the previously output key. If order is preserved, output this extra truth when the "spill" to disk happens as a header field. If not, you can stop calling the comparator as soon as output fails to be ordered a single time. In any case, this means that sorts can be skipped on any output sequences that are already sorted, and only applied to output sequences that aren't. On 1/25/07, Doug Judd <[EMAIL PROTECTED]> wrote: > > Thinking about this a little more, the RDBMS join example is not such a > good > one, since you have to sort by foreign key anyway, and you can do this > sort > and merge as a single normal Map-reduce job. However, there are cases > where > you know that the output of the map() phase is already "sorted". You > should > be able to set a flag telling the mapred framework that this is the case > so > that in the Reduce phase, the files are simply merged together in one pass > instead of having to do two passes, first verifying that the input is > sorted > and a second pass to do the merge. > > - Doug > > On 1/25/07, Doug Judd <[EMAIL PROTECTED]> wrote: > > > > Comments inline ... > > > > On 1/25/07, Bryan A. P. Pendleton <[EMAIL PROTECTED]> wrote: > > > > > > If the output is already sorted, the sort pass *should* be able to run > > > in > > > linear time - perhaps not worth optimizing it out for cases of sorted > > > output. > > > > > > Agreed. There's no reason why the framework can't detect this and > > automatically skip the n log(n) sort. > > > > Given the scatter/reassemble nature of the default map/reduce > > > (scatter by Partition, by default by the Hash), inputs that are sorted > > > may > > > not be written as such to output..... so, if you're counting on sorted > > > data, > > > maybe it's best to leave the sort in (and verify that the current > > > infrastructure will perform well given sorted input). Otherwise, if > > > there is > > > no implication/need of sorted output, then sort can be totally > disabled. > > > > > > I guess when I say "sorted" I really mean aggregated. If you know the > > input is "aggregated" (e.g. it's the output of a previous Map/Reduce > job) > > and the map() function preserves this aggregation, then the step of > building > > the intermediate results from the map output should be done in linear > time. > > > > I do feel that this optimization is important. As Adrezej points out, > > there are quite a few applications that could benefit from this. In > > particular, the join example. Without it, you can't really justify > > replacing your large RDBMS system with a Map-Reduce (Bigtable) > > implementation. > > > > - Doug > > > > > > > > > > -- Bryan A. P. Pendleton Ph: (877) geek-1-bp