Yes. Strictly speaking. In the third sentence. But not in the first two. I can see how you could misread that.
I went through all 54 datasets on S3 and didn't see any off-hand that are usable. Several are interesting, but they need additional work to be a useful test. On Sat, Oct 5, 2013 at 8:06 PM, Andrew Musselman <[email protected] > wrote: > > These are data sets. > > > That's what you asked for but okay. > > > On Oct 5, 2013, at 7:08 PM, Ted Dunning <[email protected]> wrote: > > > > These are data sets. Not sample data for testing. > > > > If you have good examples of how to use one or more of these data sets > for > > a realistic test case or demo, please speak up. > > > > > > On Sat, Oct 5, 2013 at 6:46 PM, Andrew Musselman < > [email protected] > >> wrote: > > > >> Amazon hosts some public data sets at > >> http://aws.amazon.com/publicdatasets/ and > http://aws.amazon.com/datasets > >> > >>> On Oct 5, 2013, at 1:11 PM, Ted Dunning <[email protected]> wrote: > >>> > >>> I was asked to answer an anonymous question about the future of Mahout > on > >>> Quora and thought I should share the answer here as well. > >>> > >>> That really depends on where the community of users wants to take > Mahout. > >>> > >>> Some possibilities include: > >>> > >>> a) better classifiers. Mahout's capabilities in this respect include > >> Naive > >>> Bayes, Random Forest and logistic regression trained via single > threaded > >>> stochastic gradient descent (SGD). It would be good to have a high > >> quality > >>> parallel implementation of SGD and it would be good to have some kind > of > >>> deep learning as well. The random forest could also use some work. > >>> > >>> b) faster horses. I think that the sparse matrices can be made > >>> significantly faster even considering the cost-based optimizer versions > >>> that we already have. The addition of JBLAS support for dense matrices > >>> would also be interesting. > >>> > >>> c) better API interfaces. The clustering interfaces are a bit of a > >>> shambles in spite of the cool capabilities available with streaming > >> k-means > >>> and friends. > >>> > >>> d) better human interfaces. It would be great to have products like > >>> Dataiku drive Mahout capabilities. Dataiku does a really great job of > >> the > >>> cleansing end of machine learning and Mahout really has not much in > that > >>> area. It would also be nice to move forward with Dmitriy Lyubimov's > work > >>> on Scala bindings for Mahout. > >>> > >>> e) bigger community. There are some closely related communities like > the > >>> folks working on Spark with MLI. More cross fertilization would be > very > >>> cool. > >>> > >>> f) more data. Getting sample data for testing is very hard. Getting > >> data > >>> at scale is exceedingly hard. If people could suggest a good, big and > >>> freely available dataset, that would be awesome. > >>> > >>> None of these possibilities matter, however, if somebody doesn't do > them. > >>> So the question to each reader of this answer is "What would you like > to > >>> see and how can you help make that happen"? > >> >
