I don't, but can offer alternatives --

Just have the user download the data set. I don't think this is a big burden.
Download the data set automatically.

These are free of legal and tarball-size problems.

On Tue, Feb 9, 2010 at 2:11 PM, Robin Anil <robin.a...@gmail.com> wrote:
> I feel a need to check in a set of text documents to mahout. maybe 3-4
> categories of documents 10 each.
> can be used in clustering classification, vectorizer collocation testing and
> even frequent pattern generation
>
> And instead doing artificial tests each of it can use this to test against a
> reference implementation written in the testclass like what kmeans does.
>
> Plus we will have a baseline with which we can see improvements in these
> algorithms. Any idea of some good(legally sound :))  dataset which we can
> use?
>
> Same idea can be extended to CF also
>
>
> Robin
>

Reply via email to