I feel a need to check in a set of text documents to mahout. maybe 3-4
categories of documents 10 each.
can be used in clustering classification, vectorizer collocation testing and
even frequent pattern generation

And instead doing artificial tests each of it can use this to test against a
reference implementation written in the testclass like what kmeans does.

Plus we will have a baseline with which we can see improvements in these
algorithms. Any idea of some good(legally sound :))  dataset which we can
use?

Same idea can be extended to CF also


Robin

Reply via email to