+1 to both of these developments. I'm very happy to see corporate involvement in Mahout and I think it will be very good for the project in the long run. For-profit priorities will certainly have an impact upon our future activities but this will lead to broader market acceptance and use.

On 4/4/12 7:18 PM, Jake Mannix wrote:
+1 to everything Ted said.

   As an added point, while we're on the subject of corporate involvement,
forks, and extensions of Mahout, now is as good a time as any to announce
that I (and my teammate Andy Schlaikjer) are maintaining a official
"Twitter fork" of Mahout (hosted and worked on entirely in the open on
GitHub: http://github.com/twitter/mahout ), which we'll be making patches
off of to submit back to Apache trunk on a periodic basis.

   You might well ask: why not just submit JIRA tickets and patches
directly, esp. because this twitter team has a committer?  The reasoning I
had was one of expedience and safety: there are modifications and
improvements which I have wanted available in our internal build (which
pulls from our corp maven repo), but still haven't undergone solid
testing.

   I could apply patches to a particular trunk svn rev, and deploy that
internally (like lots of places have "hadoop-0.20.3+patch5" and we have
patched pig, etc), but a) I like being able to just commit to a gitrepo,
pull in changes, iterate, test, cut a release tag, push immediately into
maven for consumption by appropriate internal projects; and b) I wanted it
out in the open to keep myself honest: doing it internally would open the
possibility of accidentally mixing private and public code, and also, if I
get lazy and don't contribute the code back to trunk, anyone else is free
to generate a patch and do it themselves (c.f. slowness of getting
HBase/HDFS fixes out of Facebook, historically).

   Right now, twitter's fork is primarily focused around LDA / topic
modeling work, but recently I've been also working on a nice little jruby
REPL wrapper.  Currently it only supports loading SequenceFiles of
dictionaries and Vectors into memory and running LDA inference and
introspecting on the models themself.  Invokable via
"$MAHOUT-HOME/bin/mahout console" if you have JRUBY-HOME defined.  That
console provides a WAY faster way to inspect models, vectors, etc, and in
fact would be a great place to launch jobs from, if we take the approach
mentioned recently of having the run() method of AbstractJob be async, and
return a handle on the current running state of the job.  Then you could
start up a console in screen, launch your job, and check in on it.

   Not to threadjack, but if we're talking about forks, commercial
development and so forth, I thought now was as good a time as any to talk
about this!

   -jake

On Apr 4, 2012 2:36 PM, "Ted Dunning"<ted.dunn...@gmail.com>  wrote:

With this announcement, this group has a fork in the road facing us.

We can choose the Hadoop path of forcibly excluding anybody with a slightly
wrong commercial taint from discussions (I call this the "more GNU than
GNU" philosophy).

Or we can choose a real community based approach that includes vendors
regardless of how they use the code that we freely give away via the Apache
Mahout project (I call this "the Apache way").

As you may guess from the way that I phrase these options, I would prefer
the second approach.

As such, I like it if we could resolve as a group that we very much welcome
what Sean is doing as an augmentation rather than diminution of the major
role that he has played in Mahout so far.  More than that, I would like to
go on record saying that I, at least, am happy to have all kinds of
participation in Mahout.

Is this the consensus here?  I think it is important to bring this subject
up early and get a definitive consensus rather than let it drift.

On Wed, Apr 4, 2012 at 12:33 PM, Sean Owen<sro...@gmail.com>  wrote:>  Dear
all -- I've long pro...


Reply via email to