I don't, but can offer alternatives -- Just have the user download the data set. I don't think this is a big burden. Download the data set automatically.
These are free of legal and tarball-size problems. On Tue, Feb 9, 2010 at 2:11 PM, Robin Anil <robin.a...@gmail.com> wrote: > I feel a need to check in a set of text documents to mahout. maybe 3-4 > categories of documents 10 each. > can be used in clustering classification, vectorizer collocation testing and > even frequent pattern generation > > And instead doing artificial tests each of it can use this to test against a > reference implementation written in the testclass like what kmeans does. > > Plus we will have a baseline with which we can see improvements in these > algorithms. Any idea of some good(legally sound :)) dataset which we can > use? > > Same idea can be extended to CF also > > > Robin >