Yeah that sounds ok. Do we have the pure content without html ?

Robin

On Tue, Feb 9, 2010 at 8:24 PM, Grant Ingersoll <gsing...@apache.org> wrote:

> Sure, how about a bunch of Apache project websites?  The project name is
> the "category", i.e. Lucene, Tomcat, Hadoop, etc.
>
>
> On Feb 9, 2010, at 9:11 AM, Robin Anil wrote:
>
> > I feel a need to check in a set of text documents to mahout. maybe 3-4
> > categories of documents 10 each.
> > can be used in clustering classification, vectorizer collocation testing
> and
> > even frequent pattern generation
> >
> > And instead doing artificial tests each of it can use this to test
> against a
> > reference implementation written in the testclass like what kmeans does.
> >
> > Plus we will have a baseline with which we can see improvements in these
> > algorithms. Any idea of some good(legally sound :))  dataset which we can
> > use?
> >
> > Same idea can be extended to CF also
> >
> >
> > Robin
>
>
>

Reply via email to