Axel, how did this go? I'd love to know if you got to 1B.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Webmaster <[EMAIL PROTECTED]> > To: [email protected] > Sent: Tuesday, October 7, 2008 1:13:29 AM > Subject: Extensive web crawl > > Ok.. > > So I want to index the web.. All of it.. > > Any thoughts on how to automate this so I can just point the spider off on > it's merry way and have it return 20 billion pages? > > So far I've been injecting random portions of the DMOZ mixed with other urls > like directory.yahoo.com and wiki.org. I was hoping this would give me a > good retuen with an unrestricted URL-filter where MY.DOMAIN.COM was replaced > with *.* -- Perhaps this is my error and that should be left as is and the > last line should be +. instead of -. ? > > Anyhow after injecting 2000 urls and a few of my own I still only get back > minimal results in the range of 500 to 600k urls. > > Right now I have a new grawl going with 1 million injected urls from the > DMOZ, I'm thinking that this should return a 20 million page index at > least.. No? > > Anyhow.. I have more HD space on the way and would like to get the indexing > up to 1 billion by the end of the week.. > > Any examples on how to set up the url-filter.txt and regex-filter.txt would > be helpful.. > > Thanks.. > > Axel..
