+1 to everything Ted said. As an added point, while we're on the subject of corporate involvement, forks, and extensions of Mahout, now is as good a time as any to announce that I (and my teammate Andy Schlaikjer) are maintaining a official "Twitter fork" of Mahout (hosted and worked on entirely in the open on GitHub: http://github.com/twitter/mahout ), which we'll be making patches off of to submit back to Apache trunk on a periodic basis.
You might well ask: why not just submit JIRA tickets and patches directly, esp. because this twitter team has a committer? The reasoning I had was one of expedience and safety: there are modifications and improvements which I have wanted available in our internal build (which pulls from our corp maven repo), but still haven't undergone solid testing. I could apply patches to a particular trunk svn rev, and deploy that internally (like lots of places have "hadoop-0.20.3+patch5" and we have patched pig, etc), but a) I like being able to just commit to a gitrepo, pull in changes, iterate, test, cut a release tag, push immediately into maven for consumption by appropriate internal projects; and b) I wanted it out in the open to keep myself honest: doing it internally would open the possibility of accidentally mixing private and public code, and also, if I get lazy and don't contribute the code back to trunk, anyone else is free to generate a patch and do it themselves (c.f. slowness of getting HBase/HDFS fixes out of Facebook, historically). Right now, twitter's fork is primarily focused around LDA / topic modeling work, but recently I've been also working on a nice little jruby REPL wrapper. Currently it only supports loading SequenceFiles of dictionaries and Vectors into memory and running LDA inference and introspecting on the models themself. Invokable via "$MAHOUT-HOME/bin/mahout console" if you have JRUBY-HOME defined. That console provides a WAY faster way to inspect models, vectors, etc, and in fact would be a great place to launch jobs from, if we take the approach mentioned recently of having the run() method of AbstractJob be async, and return a handle on the current running state of the job. Then you could start up a console in screen, launch your job, and check in on it. Not to threadjack, but if we're talking about forks, commercial development and so forth, I thought now was as good a time as any to talk about this! -jake On Apr 4, 2012 2:36 PM, "Ted Dunning" <[email protected]> wrote: With this announcement, this group has a fork in the road facing us. We can choose the Hadoop path of forcibly excluding anybody with a slightly wrong commercial taint from discussions (I call this the "more GNU than GNU" philosophy). Or we can choose a real community based approach that includes vendors regardless of how they use the code that we freely give away via the Apache Mahout project (I call this "the Apache way"). As you may guess from the way that I phrase these options, I would prefer the second approach. As such, I like it if we could resolve as a group that we very much welcome what Sean is doing as an augmentation rather than diminution of the major role that he has played in Mahout so far. More than that, I would like to go on record saying that I, at least, am happy to have all kinds of participation in Mahout. Is this the consensus here? I think it is important to bring this subject up early and get a definitive consensus rather than let it drift. On Wed, Apr 4, 2012 at 12:33 PM, Sean Owen <[email protected]> wrote: > Dear all -- I've long pro...
