On Mon, Mar 11, 2013 at 1:24 PM, Sebastian Schelter <[email protected]> wrote:
> Ideally, as implementor of a machine learning library wouldn't want to > think about how to most efficiently execute joins. It's data dependent > anyway in most cases. You would want to have an optimizer similar to the > ones used in databases that takes your map reduce data flow and figures > out the best way to execute it. > And that's exactly the case which i was referring to as MR being "too low level api". That's why i turned to spark, at least in a cautious investigative way, because of the promise to provide higher level API (flume-like) and being cached in memory (restart/excessive I/O in pipelines) and combining with Bagel primitives on the same intermediate dataset (which, as far as i understand, is exactly what Ted said, sort-less redistribution to buckets). It is so much richer. I understand that in the space of Mahout, we probably will have to wait the promise of hybrid apis in Yarn etc. hadoop native stuff, but isn't really what would solve iterative structured and interconnected stuff? > > On 11.03.2013 21:16, Ted Dunning wrote: > > Kinda sorta.. > > > > You can defeat most of the sort if you want to just hash things to > buckets. > > > > On Mon, Mar 11, 2013 at 12:01 PM, Dmitriy Lyubimov <[email protected] > >wrote: > > > >> Sort component adds log to > >> the asymptotic complexity, whereas it is clear that any streaming merge > >> algorithm just wouldn't need to do sort and capitalize on the structure > we > >> already know . (sure, you can do it map-side with a specific streaming > join > >> logic but that would not be pure MR but rather some map task > acrobatics). > >> > > > >
