Re: potential test data for mahout

Grant Ingersoll Thu, 19 Nov 2009 05:15:56 -0800

Probably a discussion better suited for the Open Relevance Project 
(http://lucene.apache.org/openrelevance).  That being said, the primary problem 
we have is one of redistribution.  If you can give us a pointer to it and we 
can know that it isn't going to change, that is probably the best thing.  
Personally, I'd love to see/use it.


-Grant
On Nov 19, 2009, at 7:40 AM, Gérard Dupont wrote:

> Hi,
> 
> I'm a bit out of the discussion and don't know what is the exact scope of
> the test needed, however, I still have the IOI crawl pages and Lucene
> indexes which have been offered after the end of the search wikia project.
> It totally not classified data but quite large (I have something like 30M
> pages in mind). Do you have any use of such data ? Again it's raw crawl, no
> classification has been applied.
> 
> cheers
> 
> On Thu, Nov 19, 2009 at 13:02, Grant Ingersoll <[email protected]> wrote:
> 
>> Very cool, I've added these to our collections wiki:
>> http://cwiki.apache.org/confluence/display/MAHOUT/Collections
>> 
>> On Nov 19, 2009, at 3:31 AM, Robert Muir wrote:
>> 
>>> Hello,
>>> 
>>> While doing some work for the open relevance project, I thought that a
>> large
>>> corpus of categorized documents might be useful test data for mahout.
>>> 
>>> Here is one I am working with:
>>> http://ece.ut.ac.ir/DBRG/Hamshahri/(Approximately<http://ece.ut.ac.ir/DBRG/Hamshahri/%28Approximately>160k
>>>  categorized
>>> docs)
>>> There is a newer beta verson here:
>>> http://ece.ut.ac.ir/DBRG/Hamshahri/ham2/(Approximately<http://ece.ut.ac.ir/DBRG/Hamshahri/ham2/%28Approximately>320k
>>> categorized docs)
>>> 
>>> --
>>> Robert Muir
>>> [email protected]
>> 
>> 
> 
> 
> -- 
> Gérard Dupont
> Information Processing Control and Cognition (IPCC) - EADS DS
> http://forge.ow2.org/projects/weblab/
> 
> Document & Learning team - LITIS Laboratory

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using 
Solr/Lucene:
http://www.lucidimagination.com/search

Re: potential test data for mahout

Reply via email to