On a related note, wish i could share the data i have to see how these
algorithms stack up to the ones we use for large scale learning.

Are there other examples of large data sets people use? I know there's the
Exxon one and possibly the one used in  the netflix prize.

There's also image net but it's academic large scale...

On Fri, Apr 20, 2012 at 7:10 AM, Jeff Eastman <j...@windwardsolutions.com>wrote:

> +1 to both of these developments. I'm very happy to see corporate
> involvement in Mahout and I think it will be very good for the project in
> the long run. For-profit priorities will certainly have an impact upon our
> future activities but this will lead to broader market acceptance and use.
>
>
> On 4/4/12 7:18 PM, Jake Mannix wrote:
>
>> +1 to everything Ted said.
>>
>>   As an added point, while we're on the subject of corporate involvement,
>> forks, and extensions of Mahout, now is as good a time as any to announce
>> that I (and my teammate Andy Schlaikjer) are maintaining a official
>> "Twitter fork" of Mahout (hosted and worked on entirely in the open on
>> GitHub: 
>> http://github.com/twitter/**mahout<http://github.com/twitter/mahout>), which 
>> we'll be making patches
>> off of to submit back to Apache trunk on a periodic basis.
>>
>>   You might well ask: why not just submit JIRA tickets and patches
>> directly, esp. because this twitter team has a committer?  The reasoning I
>> had was one of expedience and safety: there are modifications and
>> improvements which I have wanted available in our internal build (which
>> pulls from our corp maven repo), but still haven't undergone solid
>> testing.
>>
>>   I could apply patches to a particular trunk svn rev, and deploy that
>> internally (like lots of places have "hadoop-0.20.3+patch5" and we have
>> patched pig, etc), but a) I like being able to just commit to a gitrepo,
>> pull in changes, iterate, test, cut a release tag, push immediately into
>> maven for consumption by appropriate internal projects; and b) I wanted it
>> out in the open to keep myself honest: doing it internally would open the
>> possibility of accidentally mixing private and public code, and also, if I
>> get lazy and don't contribute the code back to trunk, anyone else is free
>> to generate a patch and do it themselves (c.f. slowness of getting
>> HBase/HDFS fixes out of Facebook, historically).
>>
>>   Right now, twitter's fork is primarily focused around LDA / topic
>> modeling work, but recently I've been also working on a nice little jruby
>> REPL wrapper.  Currently it only supports loading SequenceFiles of
>> dictionaries and Vectors into memory and running LDA inference and
>> introspecting on the models themself.  Invokable via
>> "$MAHOUT-HOME/bin/mahout console" if you have JRUBY-HOME defined.  That
>> console provides a WAY faster way to inspect models, vectors, etc, and in
>> fact would be a great place to launch jobs from, if we take the approach
>> mentioned recently of having the run() method of AbstractJob be async, and
>> return a handle on the current running state of the job.  Then you could
>> start up a console in screen, launch your job, and check in on it.
>>
>>   Not to threadjack, but if we're talking about forks, commercial
>> development and so forth, I thought now was as good a time as any to talk
>> about this!
>>
>>   -jake
>>
>> On Apr 4, 2012 2:36 PM, "Ted Dunning"<ted.dunn...@gmail.com**>  wrote:
>>
>> With this announcement, this group has a fork in the road facing us.
>>
>> We can choose the Hadoop path of forcibly excluding anybody with a
>> slightly
>> wrong commercial taint from discussions (I call this the "more GNU than
>> GNU" philosophy).
>>
>> Or we can choose a real community based approach that includes vendors
>> regardless of how they use the code that we freely give away via the
>> Apache
>> Mahout project (I call this "the Apache way").
>>
>> As you may guess from the way that I phrase these options, I would prefer
>> the second approach.
>>
>> As such, I like it if we could resolve as a group that we very much
>> welcome
>> what Sean is doing as an augmentation rather than diminution of the major
>> role that he has played in Mahout so far.  More than that, I would like to
>> go on record saying that I, at least, am happy to have all kinds of
>> participation in Mahout.
>>
>> Is this the consensus here?  I think it is important to bring this subject
>> up early and get a definitive consensus rather than let it drift.
>>
>> On Wed, Apr 4, 2012 at 12:33 PM, Sean Owen<sro...@gmail.com>  wrote:>
>>  Dear
>> all -- I've long pro...
>>
>>
>


-- 
Yee Yang Li Hector <https://plus.google.com/106746796711269457249>
Professional Profile <http://www.linkedin.com/in/yeehector>
http://hectorgon.blogspot.com/ (tech + travel)
http://hectorgon.com (book reviews)

Reply via email to