tks Dennis

Are there any further details of this DB?

Paul




________________________________
From: Dennis Kubes <ku...@apache.org>
To: nutch-user@lucene.apache.org
Sent: Monday, 22 June, 2009 19:45:10
Subject: Re: adding pre-indexed DB's together

There is still the url crawl db which had over 1Billion urls at last count.  So 
it might be a good starting point for crawling the web.  At last count though 
it was 250G in size so no downloadable unless you have a fast connection.  It 
is available for anyone that wants it though.

Dennis

Otis Gospodnetic wrote:
> Paul,
> 
> There was talk of this in the past, at least between some other people here 
> and me, possibly "off-line".  Your best bet may be going to what's left of 
> Wikia Search and getting their old index.  But, you see, this is exactly the 
> problem - the index may be quite outdated by now.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
>> From: Paul Jones <paul_jone...@yahoo.co.uk>
>> To: nutch-user@lucene.apache.org
>> Sent: Sunday, June 21, 2009 7:17:21 PM
>> Subject: adding pre-indexed DB's together
>> 
>> Hi
>> 
>> A newbie to the world of lucene, nutch , mahout, spent all weekend on 
>> Mahout, and now looking at Nutch. So I have a question, its seems (after 
>> reading the archives) that alot of people are using Nutch to index the web, 
>> whether for vertical searches, or just the web as a whole. Now rather than 
>> everyone starting again from scratch, and since very little (if any) "IP" 
>> would exist in the index, since nothing clever has been done to them except 
>> being processed by Nutch, would it not be possible to "share" all these 
>> indexes with each other, i.e if someone has built an index of all blogs, or 
>> all car related websites, or just indexed 100 million webpages at random. 
>> Maybe there is some tech reason I am missing.
>> 
>> Paul
> 



      

Reply via email to