Probably a discussion better suited for the Open Relevance Project (http://lucene.apache.org/openrelevance). That being said, the primary problem we have is one of redistribution. If you can give us a pointer to it and we can know that it isn't going to change, that is probably the best thing. Personally, I'd love to see/use it.
-Grant On Nov 19, 2009, at 7:40 AM, Gérard Dupont wrote: > Hi, > > I'm a bit out of the discussion and don't know what is the exact scope of > the test needed, however, I still have the IOI crawl pages and Lucene > indexes which have been offered after the end of the search wikia project. > It totally not classified data but quite large (I have something like 30M > pages in mind). Do you have any use of such data ? Again it's raw crawl, no > classification has been applied. > > cheers > > On Thu, Nov 19, 2009 at 13:02, Grant Ingersoll <[email protected]> wrote: > >> Very cool, I've added these to our collections wiki: >> http://cwiki.apache.org/confluence/display/MAHOUT/Collections >> >> On Nov 19, 2009, at 3:31 AM, Robert Muir wrote: >> >>> Hello, >>> >>> While doing some work for the open relevance project, I thought that a >> large >>> corpus of categorized documents might be useful test data for mahout. >>> >>> Here is one I am working with: >>> http://ece.ut.ac.ir/DBRG/Hamshahri/(Approximately<http://ece.ut.ac.ir/DBRG/Hamshahri/%28Approximately>160k >>> categorized >>> docs) >>> There is a newer beta verson here: >>> http://ece.ut.ac.ir/DBRG/Hamshahri/ham2/(Approximately<http://ece.ut.ac.ir/DBRG/Hamshahri/ham2/%28Approximately>320k >>> categorized docs) >>> >>> -- >>> Robert Muir >>> [email protected] >> >> > > > -- > Gérard Dupont > Information Processing Control and Cognition (IPCC) - EADS DS > http://forge.ow2.org/projects/weblab/ > > Document & Learning team - LITIS Laboratory -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
