I spent the last months working on the Stratosphere system, which is developed by my group. It's a research prototype, but it's got so much things that we would need.
It extends the MapReduce model, for joins, e.g. there is a new operator called 'Match' which lets you apply your user code to the result of an equi-join. The nice thing is that the system automatically chooses an efficient execution strategy for the join. Having something like this production ready would save us so much code, as a lot of our implementations consist of hand-coded joins. On 11.03.2013 21:43, Dmitriy Lyubimov wrote: > On Mon, Mar 11, 2013 at 1:24 PM, Sebastian Schelter <[email protected]> wrote: > >> Ideally, as implementor of a machine learning library wouldn't want to >> think about how to most efficiently execute joins. It's data dependent >> anyway in most cases. You would want to have an optimizer similar to the >> ones used in databases that takes your map reduce data flow and figures >> out the best way to execute it. >> > > And that's exactly the case which i was referring to as MR being "too low > level api". > > That's why i turned to spark, at least in a cautious investigative way, > because of the promise to provide higher level API (flume-like) and being > cached in memory (restart/excessive I/O in pipelines) and combining with > Bagel primitives on the same intermediate dataset (which, as far as i > understand, is exactly what Ted said, sort-less redistribution to buckets). > It is so much richer. > > I understand that in the space of Mahout, we probably will have to wait the > promise of hybrid apis in Yarn etc. hadoop native stuff, but isn't really > what would solve iterative structured and interconnected stuff? > > >> >> On 11.03.2013 21:16, Ted Dunning wrote: >>> Kinda sorta.. >>> >>> You can defeat most of the sort if you want to just hash things to >> buckets. >>> >>> On Mon, Mar 11, 2013 at 12:01 PM, Dmitriy Lyubimov <[email protected] >>> wrote: >>> >>>> Sort component adds log to >>>> the asymptotic complexity, whereas it is clear that any streaming merge >>>> algorithm just wouldn't need to do sort and capitalize on the structure >> we >>>> already know . (sure, you can do it map-side with a specific streaming >> join >>>> logic but that would not be pure MR but rather some map task >> acrobatics). >>>> >>> >> >> >
