Ideally, as implementor of a machine learning library wouldn't want to think about how to most efficiently execute joins. It's data dependent anyway in most cases. You would want to have an optimizer similar to the ones used in databases that takes your map reduce data flow and figures out the best way to execute it.
On 11.03.2013 21:16, Ted Dunning wrote: > Kinda sorta.. > > You can defeat most of the sort if you want to just hash things to buckets. > > On Mon, Mar 11, 2013 at 12:01 PM, Dmitriy Lyubimov <[email protected]>wrote: > >> Sort component adds log to >> the asymptotic complexity, whereas it is clear that any streaming merge >> algorithm just wouldn't need to do sort and capitalize on the structure we >> already know . (sure, you can do it map-side with a specific streaming join >> logic but that would not be pure MR but rather some map task acrobatics). >> >
