I think contributions of (a) and (b) type are expected by contributions of
type (c) (i.e. nothing common parts with Mahout) has to be considered
specifically. Some contributions may not currently have common parts with
Mahout but may become having ones in the near future, i'd say we try to
accomodate it. However, in this case as it stands this algorithm seem to
have no common parts with what we do.

what we do right now is linear algebra abstractions; maybe some frames
later and i have also some ideas of probabilistic inference abstractions in
mind for the future.

However we don't plan to have any streaming abstractions and likely will
not have. I don't see anything wrong with it going to MLLib, this will make
Spark-based ML even more attractive. As long as Andy makes his work
available to the common good, I can't care less which Spark-friendly
package he places it into.

I also don't see the task of algorithm richness as the goal in itself. The
primary goal of our work is to make quick prototyping possible in algebraic
and probabilistic fitting, and possibly feature preps. This IMO is more
important than being a rigid collection of things. Again, i can only send
one to Julia's blog as the manifesto of this philosophy.

Yes, we want to be practically useful with some examples of end2end
pipelines, and for that we probably must package some common approaches;
but in the end, in production i am likely end up not using their exact
version but rather a customized one in some way, just like i don't end up
using exact mllib versions of algorithms.

Speaking of something tangible, i'd rather see our feature prep pipeline
standardized and abstracted from engines rather than acquire more methods
right now; that would validate a lot of what we do. If in a few months we
were able to put end2end demo starting with feature encoding, that'd be a
big deal to me.



On Wed, Jun 18, 2014 at 1:29 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

> Taken from: Re: [jira] [Resolved] (MAHOUT-1153) Implement streaming random
> forests
>
> > Also, we don't have any mappings for Spark Streaming -- so if your
> > implementation heavily relies on Spark streaming, i think Spark itself is
> > the right place for it to be a part of.
>
> We are discouraging engine specific work? Even dismissing Spark Streaming
> as a whole?
>
> > As it stands we don't have purely (c) methods and indeed i believe these
> > methods may be totally engine-specific in which case mllib is one of
> > possibly good homes for them.
>
> Adherence to a specific incarnation of an engine-neutral DSL has become a
> requirement for inclusion in Mahout? The current DSL cannot be extended? Or
> it can’t be extended with engine specific ways? Or it can’t be extended
> with Spark Streaming? I would have thought all of these things desirable
> otherwise we are limiting ourselves to a subset of what an engine can do or
> a subset of problems that the current DSL supports.
>
> I hope I’m misreading this but it looks like we just discourage a
> contributor from adding post hadoop code in an interesting area to Mahout?
>
>

Reply via email to