On a related note, wish i could share the data i have to see how these algorithms stack up to the ones we use for large scale learning.
Are there other examples of large data sets people use? I know there's the Exxon one and possibly the one used in the netflix prize. There's also image net but it's academic large scale... On Fri, Apr 20, 2012 at 7:10 AM, Jeff Eastman <j...@windwardsolutions.com>wrote: > +1 to both of these developments. I'm very happy to see corporate > involvement in Mahout and I think it will be very good for the project in > the long run. For-profit priorities will certainly have an impact upon our > future activities but this will lead to broader market acceptance and use. > > > On 4/4/12 7:18 PM, Jake Mannix wrote: > >> +1 to everything Ted said. >> >> As an added point, while we're on the subject of corporate involvement, >> forks, and extensions of Mahout, now is as good a time as any to announce >> that I (and my teammate Andy Schlaikjer) are maintaining a official >> "Twitter fork" of Mahout (hosted and worked on entirely in the open on >> GitHub: >> http://github.com/twitter/**mahout<http://github.com/twitter/mahout>), which >> we'll be making patches >> off of to submit back to Apache trunk on a periodic basis. >> >> You might well ask: why not just submit JIRA tickets and patches >> directly, esp. because this twitter team has a committer? The reasoning I >> had was one of expedience and safety: there are modifications and >> improvements which I have wanted available in our internal build (which >> pulls from our corp maven repo), but still haven't undergone solid >> testing. >> >> I could apply patches to a particular trunk svn rev, and deploy that >> internally (like lots of places have "hadoop-0.20.3+patch5" and we have >> patched pig, etc), but a) I like being able to just commit to a gitrepo, >> pull in changes, iterate, test, cut a release tag, push immediately into >> maven for consumption by appropriate internal projects; and b) I wanted it >> out in the open to keep myself honest: doing it internally would open the >> possibility of accidentally mixing private and public code, and also, if I >> get lazy and don't contribute the code back to trunk, anyone else is free >> to generate a patch and do it themselves (c.f. slowness of getting >> HBase/HDFS fixes out of Facebook, historically). >> >> Right now, twitter's fork is primarily focused around LDA / topic >> modeling work, but recently I've been also working on a nice little jruby >> REPL wrapper. Currently it only supports loading SequenceFiles of >> dictionaries and Vectors into memory and running LDA inference and >> introspecting on the models themself. Invokable via >> "$MAHOUT-HOME/bin/mahout console" if you have JRUBY-HOME defined. That >> console provides a WAY faster way to inspect models, vectors, etc, and in >> fact would be a great place to launch jobs from, if we take the approach >> mentioned recently of having the run() method of AbstractJob be async, and >> return a handle on the current running state of the job. Then you could >> start up a console in screen, launch your job, and check in on it. >> >> Not to threadjack, but if we're talking about forks, commercial >> development and so forth, I thought now was as good a time as any to talk >> about this! >> >> -jake >> >> On Apr 4, 2012 2:36 PM, "Ted Dunning"<ted.dunn...@gmail.com**> wrote: >> >> With this announcement, this group has a fork in the road facing us. >> >> We can choose the Hadoop path of forcibly excluding anybody with a >> slightly >> wrong commercial taint from discussions (I call this the "more GNU than >> GNU" philosophy). >> >> Or we can choose a real community based approach that includes vendors >> regardless of how they use the code that we freely give away via the >> Apache >> Mahout project (I call this "the Apache way"). >> >> As you may guess from the way that I phrase these options, I would prefer >> the second approach. >> >> As such, I like it if we could resolve as a group that we very much >> welcome >> what Sean is doing as an augmentation rather than diminution of the major >> role that he has played in Mahout so far. More than that, I would like to >> go on record saying that I, at least, am happy to have all kinds of >> participation in Mahout. >> >> Is this the consensus here? I think it is important to bring this subject >> up early and get a definitive consensus rather than let it drift. >> >> On Wed, Apr 4, 2012 at 12:33 PM, Sean Owen<sro...@gmail.com> wrote:> >> Dear >> all -- I've long pro... >> >> > -- Yee Yang Li Hector <https://plus.google.com/106746796711269457249> Professional Profile <http://www.linkedin.com/in/yeehector> http://hectorgon.blogspot.com/ (tech + travel) http://hectorgon.com (book reviews)