Re: Extensive web crawl

Otis Gospodnetic Mon, 20 Oct 2008 08:39:37 -0700

Axel, how did this go?  I'd love to know if you got to 1B.



Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Webmaster <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Tuesday, October 7, 2008 1:13:29 AM
> Subject: Extensive web crawl
> 
> Ok..
> 
> So I want to index the web..  All of it..
> 
> Any thoughts on how to automate this so I can just point the spider off on
> it's merry way and have it return 20 billion pages?
> 
> So far I've been injecting random portions of the DMOZ mixed with other urls
> like directory.yahoo.com and wiki.org.  I was hoping this would give me a
> good retuen with an unrestricted URL-filter where MY.DOMAIN.COM was replaced
> with *.* --  Perhaps this is my error and that should be left as is and the
> last line should be +. instead of -. ?
> 
> Anyhow after injecting 2000 urls and a few of my own I still only get back
> minimal results in the range of 500 to 600k urls.
> 
> Right now I have a new grawl going with 1 million injected urls from the
> DMOZ, I'm thinking that this should return a 20 million page index at
> least..  No?
> 
> Anyhow..  I have more HD space on the way and would like to get the indexing
> up to 1 billion by the end of the week..
> 
> Any examples on how to set up the url-filter.txt and regex-filter.txt would
> be helpful..
> 
> Thanks..
> 
> Axel..

Re: Extensive web crawl

Reply via email to