Re: Discossuon Of ML environment/MR, Mahout

Sebastian Schelter Mon, 11 Mar 2013 13:54:49 -0700

I spent the last months working on the Stratosphere system, which is
developed by my group. It's a research prototype, but it's got so much
things that we would need.


It extends the MapReduce model, for joins, e.g. there is a new operator
called 'Match' which lets you apply your user code to the result of an
equi-join. The nice thing is that the system automatically chooses an
efficient execution strategy for the join. Having something like this
production ready would save us so much code, as a lot of our
implementations consist of hand-coded joins.

On 11.03.2013 21:43, Dmitriy Lyubimov wrote:
> On Mon, Mar 11, 2013 at 1:24 PM, Sebastian Schelter <[email protected]> wrote:
> 
>> Ideally, as implementor of a machine learning library wouldn't want to
>> think about how to most efficiently execute joins. It's data dependent
>> anyway in most cases. You would want to have an optimizer similar to the
>> ones used in databases that takes your map reduce data flow and figures
>> out the best way to execute it.
>>
> 
> And that's exactly the case which i was referring to as MR being "too low
> level api".
> 
> That's why i turned to spark, at least in a cautious investigative way,
> because of the promise to provide higher level API (flume-like) and being
> cached in memory (restart/excessive I/O in pipelines) and combining with
> Bagel primitives on the same intermediate dataset (which, as far as i
> understand, is exactly what Ted said, sort-less redistribution to buckets).
> It is so much richer.
> 
> I understand that in the space of Mahout, we probably will have to wait the
> promise of hybrid apis in Yarn etc. hadoop native stuff, but isn't really
> what would solve iterative structured and interconnected stuff?
> 
> 
>>
>> On 11.03.2013 21:16, Ted Dunning wrote:
>>> Kinda sorta..
>>>
>>> You can defeat most of the sort if you want to just hash things to
>> buckets.
>>>
>>> On Mon, Mar 11, 2013 at 12:01 PM, Dmitriy Lyubimov <[email protected]
>>> wrote:
>>>
>>>> Sort component adds log to
>>>> the asymptotic complexity, whereas it is clear that any streaming merge
>>>> algorithm just wouldn't need to do sort and capitalize on the structure
>> we
>>>> already know . (sure, you can do it map-side with a specific streaming
>> join
>>>> logic but that would not be pure MR but rather some map task
>> acrobatics).
>>>>
>>>
>>
>>
>

Re: Discossuon Of ML environment/MR, Mahout

Reply via email to