Couple of thoughts, some slightly bigger than this specific topic:
1. I'm not against C++, but it hasn't attracted a lot of attention
just yet here in Mahout, either. One thought is we could port it,
given a donated implementation as a reference. We can put it in a
sandbox as Sean suggested.
2. I've always envisioned Mahout as a TLP. For instance, I've talked
with the OpenNLP maintainers about donating it (and they seem
amenable, just need to find time) along with the Maxent
implementation. Under this vision, Mahout is a TLP chartered to
provide machine learning implementations and has multiple
subprojects. I could certainly see a subproject for C++
implementations. For instance, we could have:
1. Core Java (common utilities, algorithms, etc.)
2. Core C/C++ (ditto)
3. OpenNLP (builds on Core Java, since OpenNLP's Maxent impl. would go
to core) - machine learning targeted specifically towards text -
Utilities for text processing currently in utils likely move here so
that the core can remain agnostic of input
4. Taste/Recommendations - all things collab filtering/recommendations
5. Other verticals that require core, scalable ML libraries
#2 is a longer term vision, and we are not there yet, but I think it
builds a nice tent, addresses Sean's concerns (I believe) about Mahout
being one big monolithic library with a lack of focus and rounds out
as a nice set of libraries that help real people solve real problems.
-Grant
On Oct 3, 2009, at 1:17 PM, Benson Margulies wrote:
Folks,
I may be in a position to contribute a very slick implementation of
the
Brown, dePietro, etc. bigram mutual information word clustering scheme
sometime soon. It is written in C++, and if there's any map-reduce,
its via
OpenMP, not hadoop :-).
As an ASF member, if I'm facilitating getting something useful out
as open
source, I'd rather push it out at Apache.
Any interest in stretching the Mahout tent out to accomodate it?
I'm asking now because I'm starting a negotiation with the academic
owner
thereof, and it would be useful to know in advance if I have a
tentative
home for it at Apache as opposed to having to just dump it into
SourceForge.
You could take the attitude that it's part of Mahout as a challenge:
can
anyone out there come up with a practical variation in Java/Hadoop?
--benson