Yahoo offers a 700M datapoints ratings dataset [1] which I recently
used. That's still academicly large but at least its a lot more
challenging than Netflix :)

[1] http://webscope.sandbox.yahoo.com/catalog.php?datatype=r

Best,
Sebastian

On 20.04.2012 18:05, Hector Yee wrote:
> On a related note, wish i could share the data i have to see how these
> algorithms stack up to the ones we use for large scale learning.
> 
> Are there other examples of large data sets people use? I know there's the
> Exxon one and possibly the one used in  the netflix prize.
> 
> There's also image net but it's academic large scale...
> 
> On Fri, Apr 20, 2012 at 7:10 AM, Jeff Eastman 
> <[email protected]>wrote:
> 
>> +1 to both of these developments. I'm very happy to see corporate
>> involvement in Mahout and I think it will be very good for the project in
>> the long run. For-profit priorities will certainly have an impact upon our
>> future activities but this will lead to broader market acceptance and use.
>>
>>
>> On 4/4/12 7:18 PM, Jake Mannix wrote:
>>
>>> +1 to everything Ted said.
>>>
>>>   As an added point, while we're on the subject of corporate involvement,
>>> forks, and extensions of Mahout, now is as good a time as any to announce
>>> that I (and my teammate Andy Schlaikjer) are maintaining a official
>>> "Twitter fork" of Mahout (hosted and worked on entirely in the open on
>>> GitHub: 
>>> http://github.com/twitter/**mahout<http://github.com/twitter/mahout>), 
>>> which we'll be making patches
>>> off of to submit back to Apache trunk on a periodic basis.
>>>
>>>   You might well ask: why not just submit JIRA tickets and patches
>>> directly, esp. because this twitter team has a committer?  The reasoning I
>>> had was one of expedience and safety: there are modifications and
>>> improvements which I have wanted available in our internal build (which
>>> pulls from our corp maven repo), but still haven't undergone solid
>>> testing.
>>>
>>>   I could apply patches to a particular trunk svn rev, and deploy that
>>> internally (like lots of places have "hadoop-0.20.3+patch5" and we have
>>> patched pig, etc), but a) I like being able to just commit to a gitrepo,
>>> pull in changes, iterate, test, cut a release tag, push immediately into
>>> maven for consumption by appropriate internal projects; and b) I wanted it
>>> out in the open to keep myself honest: doing it internally would open the
>>> possibility of accidentally mixing private and public code, and also, if I
>>> get lazy and don't contribute the code back to trunk, anyone else is free
>>> to generate a patch and do it themselves (c.f. slowness of getting
>>> HBase/HDFS fixes out of Facebook, historically).
>>>
>>>   Right now, twitter's fork is primarily focused around LDA / topic
>>> modeling work, but recently I've been also working on a nice little jruby
>>> REPL wrapper.  Currently it only supports loading SequenceFiles of
>>> dictionaries and Vectors into memory and running LDA inference and
>>> introspecting on the models themself.  Invokable via
>>> "$MAHOUT-HOME/bin/mahout console" if you have JRUBY-HOME defined.  That
>>> console provides a WAY faster way to inspect models, vectors, etc, and in
>>> fact would be a great place to launch jobs from, if we take the approach
>>> mentioned recently of having the run() method of AbstractJob be async, and
>>> return a handle on the current running state of the job.  Then you could
>>> start up a console in screen, launch your job, and check in on it.
>>>
>>>   Not to threadjack, but if we're talking about forks, commercial
>>> development and so forth, I thought now was as good a time as any to talk
>>> about this!
>>>
>>>   -jake
>>>
>>> On Apr 4, 2012 2:36 PM, "Ted Dunning"<[email protected]**>  wrote:
>>>
>>> With this announcement, this group has a fork in the road facing us.
>>>
>>> We can choose the Hadoop path of forcibly excluding anybody with a
>>> slightly
>>> wrong commercial taint from discussions (I call this the "more GNU than
>>> GNU" philosophy).
>>>
>>> Or we can choose a real community based approach that includes vendors
>>> regardless of how they use the code that we freely give away via the
>>> Apache
>>> Mahout project (I call this "the Apache way").
>>>
>>> As you may guess from the way that I phrase these options, I would prefer
>>> the second approach.
>>>
>>> As such, I like it if we could resolve as a group that we very much
>>> welcome
>>> what Sean is doing as an augmentation rather than diminution of the major
>>> role that he has played in Mahout so far.  More than that, I would like to
>>> go on record saying that I, at least, am happy to have all kinds of
>>> participation in Mahout.
>>>
>>> Is this the consensus here?  I think it is important to bring this subject
>>> up early and get a definitive consensus rather than let it drift.
>>>
>>> On Wed, Apr 4, 2012 at 12:33 PM, Sean Owen<[email protected]>  wrote:>
>>>  Dear
>>> all -- I've long pro...
>>>
>>>
>>
> 
> 

Reply via email to