Yahoo offers a 700M datapoints ratings dataset [1] which I recently used. That's still academicly large but at least its a lot more challenging than Netflix :)
[1] http://webscope.sandbox.yahoo.com/catalog.php?datatype=r Best, Sebastian On 20.04.2012 18:05, Hector Yee wrote: > On a related note, wish i could share the data i have to see how these > algorithms stack up to the ones we use for large scale learning. > > Are there other examples of large data sets people use? I know there's the > Exxon one and possibly the one used in the netflix prize. > > There's also image net but it's academic large scale... > > On Fri, Apr 20, 2012 at 7:10 AM, Jeff Eastman > <[email protected]>wrote: > >> +1 to both of these developments. I'm very happy to see corporate >> involvement in Mahout and I think it will be very good for the project in >> the long run. For-profit priorities will certainly have an impact upon our >> future activities but this will lead to broader market acceptance and use. >> >> >> On 4/4/12 7:18 PM, Jake Mannix wrote: >> >>> +1 to everything Ted said. >>> >>> As an added point, while we're on the subject of corporate involvement, >>> forks, and extensions of Mahout, now is as good a time as any to announce >>> that I (and my teammate Andy Schlaikjer) are maintaining a official >>> "Twitter fork" of Mahout (hosted and worked on entirely in the open on >>> GitHub: >>> http://github.com/twitter/**mahout<http://github.com/twitter/mahout>), >>> which we'll be making patches >>> off of to submit back to Apache trunk on a periodic basis. >>> >>> You might well ask: why not just submit JIRA tickets and patches >>> directly, esp. because this twitter team has a committer? The reasoning I >>> had was one of expedience and safety: there are modifications and >>> improvements which I have wanted available in our internal build (which >>> pulls from our corp maven repo), but still haven't undergone solid >>> testing. >>> >>> I could apply patches to a particular trunk svn rev, and deploy that >>> internally (like lots of places have "hadoop-0.20.3+patch5" and we have >>> patched pig, etc), but a) I like being able to just commit to a gitrepo, >>> pull in changes, iterate, test, cut a release tag, push immediately into >>> maven for consumption by appropriate internal projects; and b) I wanted it >>> out in the open to keep myself honest: doing it internally would open the >>> possibility of accidentally mixing private and public code, and also, if I >>> get lazy and don't contribute the code back to trunk, anyone else is free >>> to generate a patch and do it themselves (c.f. slowness of getting >>> HBase/HDFS fixes out of Facebook, historically). >>> >>> Right now, twitter's fork is primarily focused around LDA / topic >>> modeling work, but recently I've been also working on a nice little jruby >>> REPL wrapper. Currently it only supports loading SequenceFiles of >>> dictionaries and Vectors into memory and running LDA inference and >>> introspecting on the models themself. Invokable via >>> "$MAHOUT-HOME/bin/mahout console" if you have JRUBY-HOME defined. That >>> console provides a WAY faster way to inspect models, vectors, etc, and in >>> fact would be a great place to launch jobs from, if we take the approach >>> mentioned recently of having the run() method of AbstractJob be async, and >>> return a handle on the current running state of the job. Then you could >>> start up a console in screen, launch your job, and check in on it. >>> >>> Not to threadjack, but if we're talking about forks, commercial >>> development and so forth, I thought now was as good a time as any to talk >>> about this! >>> >>> -jake >>> >>> On Apr 4, 2012 2:36 PM, "Ted Dunning"<[email protected]**> wrote: >>> >>> With this announcement, this group has a fork in the road facing us. >>> >>> We can choose the Hadoop path of forcibly excluding anybody with a >>> slightly >>> wrong commercial taint from discussions (I call this the "more GNU than >>> GNU" philosophy). >>> >>> Or we can choose a real community based approach that includes vendors >>> regardless of how they use the code that we freely give away via the >>> Apache >>> Mahout project (I call this "the Apache way"). >>> >>> As you may guess from the way that I phrase these options, I would prefer >>> the second approach. >>> >>> As such, I like it if we could resolve as a group that we very much >>> welcome >>> what Sean is doing as an augmentation rather than diminution of the major >>> role that he has played in Mahout so far. More than that, I would like to >>> go on record saying that I, at least, am happy to have all kinds of >>> participation in Mahout. >>> >>> Is this the consensus here? I think it is important to bring this subject >>> up early and get a definitive consensus rather than let it drift. >>> >>> On Wed, Apr 4, 2012 at 12:33 PM, Sean Owen<[email protected]> wrote:> >>> Dear >>> all -- I've long pro... >>> >>> >> > >
