Yeah that sounds ok. Do we have the pure content without html ? Robin
On Tue, Feb 9, 2010 at 8:24 PM, Grant Ingersoll <gsing...@apache.org> wrote: > Sure, how about a bunch of Apache project websites? The project name is > the "category", i.e. Lucene, Tomcat, Hadoop, etc. > > > On Feb 9, 2010, at 9:11 AM, Robin Anil wrote: > > > I feel a need to check in a set of text documents to mahout. maybe 3-4 > > categories of documents 10 each. > > can be used in clustering classification, vectorizer collocation testing > and > > even frequent pattern generation > > > > And instead doing artificial tests each of it can use this to test > against a > > reference implementation written in the testclass like what kmeans does. > > > > Plus we will have a baseline with which we can see improvements in these > > algorithms. Any idea of some good(legally sound :)) dataset which we can > > use? > > > > Same idea can be extended to CF also > > > > > > Robin > > >