Re: Mahout's future

Ted Dunning Sat, 05 Oct 2013 21:54:59 -0700

Yes.  Strictly speaking.  In the third sentence.  But not in the first two.
 I can see how you could misread that.


I went through all 54 datasets on S3 and didn't see any off-hand that are
usable.  Several are interesting, but they need additional work to be a
useful test.



On Sat, Oct 5, 2013 at 8:06 PM, Andrew Musselman <[email protected]
> wrote:

> > These are data sets.
>
>
> That's what you asked for but okay.
>
> > On Oct 5, 2013, at 7:08 PM, Ted Dunning <[email protected]> wrote:
> >
> > These are data sets.  Not sample data for testing.
> >
> > If you have good examples of how to use one or more of these data sets
> for
> > a realistic test case or demo, please speak up.
> >
> >
> > On Sat, Oct 5, 2013 at 6:46 PM, Andrew Musselman <
> [email protected]
> >> wrote:
> >
> >> Amazon hosts some public data sets at
> >> http://aws.amazon.com/publicdatasets/ and
> http://aws.amazon.com/datasets
> >>
> >>> On Oct 5, 2013, at 1:11 PM, Ted Dunning <[email protected]> wrote:
> >>>
> >>> I was asked to answer an anonymous question about the future of Mahout
> on
> >>> Quora and thought I should share the answer here as well.
> >>>
> >>> That really depends on where the community of users wants to take
> Mahout.
> >>>
> >>> Some possibilities include:
> >>>
> >>> a) better classifiers.  Mahout's capabilities in this respect include
> >> Naive
> >>> Bayes, Random Forest and logistic regression trained via single
> threaded
> >>> stochastic gradient descent (SGD).  It would be good to have a high
> >> quality
> >>> parallel implementation of SGD and it would be good to have some kind
> of
> >>> deep learning as well.  The random forest could also use some work.
> >>>
> >>> b) faster horses.  I think that the sparse matrices can be made
> >>> significantly faster even considering the cost-based optimizer versions
> >>> that we already have.  The addition of JBLAS support for dense matrices
> >>> would also be interesting.
> >>>
> >>> c) better API interfaces.  The clustering interfaces are a bit of a
> >>> shambles in spite of the cool capabilities available with streaming
> >> k-means
> >>> and friends.
> >>>
> >>> d) better human interfaces.  It would be great to have products like
> >>> Dataiku drive Mahout capabilities.  Dataiku does a really great job of
> >> the
> >>> cleansing end of machine learning and Mahout really has not much in
> that
> >>> area.  It would also be nice to move forward with Dmitriy Lyubimov's
> work
> >>> on Scala bindings for Mahout.
> >>>
> >>> e) bigger community.  There are some closely related communities like
> the
> >>> folks working on Spark with MLI.  More cross fertilization would be
> very
> >>> cool.
> >>>
> >>> f) more data.  Getting sample data for testing is very hard.  Getting
> >> data
> >>> at scale is exceedingly hard.  If people could suggest a good, big and
> >>> freely available dataset, that would be awesome.
> >>>
> >>> None of these possibilities matter, however, if somebody doesn't do
> them.
> >>> So the question to each reader of this answer is "What would you like
> to
> >>> see and how can you help make that happen"?
> >>
>

Re: Mahout's future

Reply via email to