Re: potential test data for mahout

Gérard Dupont Thu, 19 Nov 2009 04:41:45 -0800

Hi,

I'm a bit out of the discussion and don't know what is the exact scope of
the test needed, however, I still have the IOI crawl pages and Lucene
indexes which have been offered after the end of the search wikia project.
It totally not classified data but quite large (I have something like 30M
pages in mind). Do you have any use of such data ? Again it's raw crawl, no
classification has been applied.


cheers

On Thu, Nov 19, 2009 at 13:02, Grant Ingersoll <[email protected]> wrote:

> Very cool, I've added these to our collections wiki:
> http://cwiki.apache.org/confluence/display/MAHOUT/Collections
>
> On Nov 19, 2009, at 3:31 AM, Robert Muir wrote:
>
> > Hello,
> >
> > While doing some work for the open relevance project, I thought that a
> large
> > corpus of categorized documents might be useful test data for mahout.
> >
> > Here is one I am working with:
> > http://ece.ut.ac.ir/DBRG/Hamshahri/(Approximately<http://ece.ut.ac.ir/DBRG/Hamshahri/%28Approximately>160k
> >  categorized
> > docs)
> > There is a newer beta verson here:
> > http://ece.ut.ac.ir/DBRG/Hamshahri/ham2/(Approximately<http://ece.ut.ac.ir/DBRG/Hamshahri/ham2/%28Approximately>320k
> > categorized docs)
> >
> > --
> > Robert Muir
> > [email protected]
>
>


-- 
Gérard Dupont
Information Processing Control and Cognition (IPCC) - EADS DS
http://forge.ow2.org/projects/weblab/

Document & Learning team - LITIS Laboratory

Re: potential test data for mahout

Reply via email to