tks Dennis Are there any further details of this DB?
Paul ________________________________ From: Dennis Kubes <ku...@apache.org> To: nutch-user@lucene.apache.org Sent: Monday, 22 June, 2009 19:45:10 Subject: Re: adding pre-indexed DB's together There is still the url crawl db which had over 1Billion urls at last count. So it might be a good starting point for crawling the web. At last count though it was 250G in size so no downloadable unless you have a fast connection. It is available for anyone that wants it though. Dennis Otis Gospodnetic wrote: > Paul, > > There was talk of this in the past, at least between some other people here > and me, possibly "off-line". Your best bet may be going to what's left of > Wikia Search and getting their old index. But, you see, this is exactly the > problem - the index may be quite outdated by now. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > ----- Original Message ---- >> From: Paul Jones <paul_jone...@yahoo.co.uk> >> To: nutch-user@lucene.apache.org >> Sent: Sunday, June 21, 2009 7:17:21 PM >> Subject: adding pre-indexed DB's together >> >> Hi >> >> A newbie to the world of lucene, nutch , mahout, spent all weekend on >> Mahout, and now looking at Nutch. So I have a question, its seems (after >> reading the archives) that alot of people are using Nutch to index the web, >> whether for vertical searches, or just the web as a whole. Now rather than >> everyone starting again from scratch, and since very little (if any) "IP" >> would exist in the index, since nothing clever has been done to them except >> being processed by Nutch, would it not be possible to "share" all these >> indexes with each other, i.e if someone has built an index of all blogs, or >> all car related websites, or just indexed 100 million webpages at random. >> Maybe there is some tech reason I am missing. >> >> Paul >