I personally am caught between a desire on one hand to be inclusive of everything, and a desire on the other hand to not make the project a collection of bits and bobs from all over, with some algorithms existing in C++, others in Java, some distributed, some not, some supported, some a one-time dump, etc. It really harms end users ability to place what Mahout 'is' and how much to expect of it. Either people will be surprised that some new scratch code isn't bug-free, or, will assume that the mature bits of the code are probably just very rough too when they may not be.
The latter wins out in my mind, in this case -- it 'feels' like a different project at this point. Let me however revive my suggestion that Mahout include a 'sandbox' module of sorts to host anything at all. This neatly allows for incorporation of anything, in any state, without confusing users as to what should be expected of Mahout 'proper', which should be a reasonably high bar come version 1.0. On Sat, Oct 3, 2009 at 5:17 PM, Benson Margulies <[email protected]> wrote: > Folks, > > I may be in a position to contribute a very slick implementation of the > Brown, dePietro, etc. bigram mutual information word clustering scheme > sometime soon. It is written in C++, and if there's any map-reduce, its via > OpenMP, not hadoop :-). > > As an ASF member, if I'm facilitating getting something useful out as open > source, I'd rather push it out at Apache. > > Any interest in stretching the Mahout tent out to accomodate it? > > I'm asking now because I'm starting a negotiation with the academic owner > thereof, and it would be useful to know in advance if I have a tentative > home for it at Apache as opposed to having to just dump it into SourceForge. > > You could take the attitude that it's part of Mahout as a challenge: can > anyone out there come up with a practical variation in Java/Hadoop? > > --benson >
