Re: MI clustering

Grant Ingersoll Fri, 20 Nov 2009 06:32:58 -0800

On Nov 20, 2009, at 9:12 AM, Benson Margulies wrote:

> Grant,
> 
> I don't mean to belabor this, but I hate to have this public record of us
> misunderstanding each other quite so relentlessly. So I'm going to try one
> more time to see if I can phrase my point of view in such a way that we will
> be better aligned, and if I fail (or if we are indeed really poorly aligned)
> ,then so be it.
> 
> This starts with a question of the mission of Mahout, TLP or not. If the
> mission of Mahout is to focus on algorithms that are expressed as map-reduce
> on Hadoop, then, honestly, I don't think this code belongs. I've studied
> this in depth (and done a weak implementation), Jethran's done two
> implementations, my friend and colleague Dr. Scott Miller has done a few,
> and none of us think that this algorithm is going to fit.


Makes sense.  I've always said Mahout is about (and I think others feel the 
same):
1. scalable Machine Learning - scale is open to interpretation.  Some 
algorithms simply do not work on M/R, as you point out, but we are still 
interested in making them as fast as possible.  This is well documented in our 
archives.
2.  ASL
3. Java (we have a Pig PLSI implementation that will likely be committed soon, 
for instance)

In the end, the only ones I feel strongly about are #1 and #2.  Others may feel 
differnt.  Personally, I know my C++ skills are rusty, so I'm not a likely 
contributor at the moment, but that shouldn't preclude others.  I just want to 
see Mahout help people solve Machine Learning problems in a scalable, 
commercial friendly way.  I don't particularly care what language is used to 
achieve that assuming there are people to support it.


> 
> If in addition, the project wants to stick with Java programs, even more so.
> This particular algorithm is one in which none of us see a way to make
> map-reduce parallelism compensate for the fundamental limitations of Java
> floating point speed. There may be another way to cluster based on MI that
> can exploit map-reduce, but this isn't it. Once I get the code posted
> somewhere, I'll let you all know where, and you are welcome to argue at that
> point.
> 
> My net impression is that the Mahout team might want to incorporate code
> that is outside the map-reduce corral, but is complementary to the broad
> mission of NLP algorithms, but that the team isn't excited about doing so
> right now.


That is a fair statement.  In 6 mos. I could see there being an NLP subproject 
to the Mahout TLP and this would fit there as a standalone subproject, IMO.  I 
certainly would love to see that. 

> 
> Then comes the process issue. I will write at the outset that I was making
> an incoherent and pretty unreasonable proposal about committer status.
> Because Java Map-Reduce technology is not applicable, at the moment, to
> things doing at our place of business, Jethran and I are not well-positioned
> to pass through the standard procedure for earning committer status on the
> project just now.

Maybe.  If you put up an initial patch and one us committed it, then your 
patches on that would be how you would earn it.

> It is true that other Apache projects have adopted
> committers in nonstandard ways, but, upon reflection, I don't see that as a
> valid analogy to the situation at hand. If you are curious, I can fill you
> in off-line as to the amusing tale of how I became a committer on
> WS-COMMONS.

:-).  I can't speak to other projects.  I'm just basing it on how I've viewed 
things as working in Lucene and the ASF.


> 
> I confess that I'm puzzled about your comment about proxy commits. Comitters
> commit other people's work from JIRAs constantly, so that can't be what you
> are talking about. If the problem is someone misrepresenting work as their
> own, then that wouldn't arise in this case. If I gave the impression that I
> planned to mislead someone I apologize, I didn't mean to. In any case, I
> think the issue is moot, since I will explore what seems reasonable to the
> labs with the labs PMC.
> 

My apologies.  I misunderstood your intent.  Not sure why I didn't give you the 
benefit of the doubt knowing you know how all of this stuff works.

So, how does this sound:
1. Go to labs for now
2. Keep an eye on us here and when we become a TLP, we'll reevaluate MI as a 
subproject replete w/ its own committers and PMC representation?  

Cheers,
Grant

Re: MI clustering

Reply via email to