I think contributions of (a) and (b) type are expected by contributions of type (c) (i.e. nothing common parts with Mahout) has to be considered specifically. Some contributions may not currently have common parts with Mahout but may become having ones in the near future, i'd say we try to accomodate it. However, in this case as it stands this algorithm seem to have no common parts with what we do.
what we do right now is linear algebra abstractions; maybe some frames later and i have also some ideas of probabilistic inference abstractions in mind for the future. However we don't plan to have any streaming abstractions and likely will not have. I don't see anything wrong with it going to MLLib, this will make Spark-based ML even more attractive. As long as Andy makes his work available to the common good, I can't care less which Spark-friendly package he places it into. I also don't see the task of algorithm richness as the goal in itself. The primary goal of our work is to make quick prototyping possible in algebraic and probabilistic fitting, and possibly feature preps. This IMO is more important than being a rigid collection of things. Again, i can only send one to Julia's blog as the manifesto of this philosophy. Yes, we want to be practically useful with some examples of end2end pipelines, and for that we probably must package some common approaches; but in the end, in production i am likely end up not using their exact version but rather a customized one in some way, just like i don't end up using exact mllib versions of algorithms. Speaking of something tangible, i'd rather see our feature prep pipeline standardized and abstracted from engines rather than acquire more methods right now; that would validate a lot of what we do. If in a few months we were able to put end2end demo starting with feature encoding, that'd be a big deal to me. On Wed, Jun 18, 2014 at 1:29 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > Taken from: Re: [jira] [Resolved] (MAHOUT-1153) Implement streaming random > forests > > > Also, we don't have any mappings for Spark Streaming -- so if your > > implementation heavily relies on Spark streaming, i think Spark itself is > > the right place for it to be a part of. > > We are discouraging engine specific work? Even dismissing Spark Streaming > as a whole? > > > As it stands we don't have purely (c) methods and indeed i believe these > > methods may be totally engine-specific in which case mllib is one of > > possibly good homes for them. > > Adherence to a specific incarnation of an engine-neutral DSL has become a > requirement for inclusion in Mahout? The current DSL cannot be extended? Or > it can’t be extended with engine specific ways? Or it can’t be extended > with Spark Streaming? I would have thought all of these things desirable > otherwise we are limiting ourselves to a subset of what an engine can do or > a subset of problems that the current DSL supports. > > I hope I’m misreading this but it looks like we just discourage a > contributor from adding post hadoop code in an interesting area to Mahout? > >