Hi, I'm a bit out of the discussion and don't know what is the exact scope of the test needed, however, I still have the IOI crawl pages and Lucene indexes which have been offered after the end of the search wikia project. It totally not classified data but quite large (I have something like 30M pages in mind). Do you have any use of such data ? Again it's raw crawl, no classification has been applied.
cheers On Thu, Nov 19, 2009 at 13:02, Grant Ingersoll <[email protected]> wrote: > Very cool, I've added these to our collections wiki: > http://cwiki.apache.org/confluence/display/MAHOUT/Collections > > On Nov 19, 2009, at 3:31 AM, Robert Muir wrote: > > > Hello, > > > > While doing some work for the open relevance project, I thought that a > large > > corpus of categorized documents might be useful test data for mahout. > > > > Here is one I am working with: > > http://ece.ut.ac.ir/DBRG/Hamshahri/(Approximately<http://ece.ut.ac.ir/DBRG/Hamshahri/%28Approximately>160k > > categorized > > docs) > > There is a newer beta verson here: > > http://ece.ut.ac.ir/DBRG/Hamshahri/ham2/(Approximately<http://ece.ut.ac.ir/DBRG/Hamshahri/ham2/%28Approximately>320k > > categorized docs) > > > > -- > > Robert Muir > > [email protected] > > -- Gérard Dupont Information Processing Control and Cognition (IPCC) - EADS DS http://forge.ow2.org/projects/weblab/ Document & Learning team - LITIS Laboratory
