Ok.. So I want to index the web.. All of it..
Any thoughts on how to automate this so I can just point the spider off on it's merry way and have it return 20 billion pages? So far I've been injecting random portions of the DMOZ mixed with other urls like directory.yahoo.com and wiki.org. I was hoping this would give me a good retuen with an unrestricted URL-filter where MY.DOMAIN.COM was replaced with *.* -- Perhaps this is my error and that should be left as is and the last line should be +. instead of -. ? Anyhow after injecting 2000 urls and a few of my own I still only get back minimal results in the range of 500 to 600k urls. Right now I have a new grawl going with 1 million injected urls from the DMOZ, I'm thinking that this should return a 20 million page index at least.. No? Anyhow.. I have more HD space on the way and would like to get the indexing up to 1 billion by the end of the week.. Any examples on how to set up the url-filter.txt and regex-filter.txt would be helpful.. Thanks.. Axel..
