I feel a need to check in a set of text documents to mahout. maybe 3-4 categories of documents 10 each. can be used in clustering classification, vectorizer collocation testing and even frequent pattern generation
And instead doing artificial tests each of it can use this to test against a reference implementation written in the testclass like what kmeans does. Plus we will have a baseline with which we can see improvements in these algorithms. Any idea of some good(legally sound :)) dataset which we can use? Same idea can be extended to CF also Robin