Hello, I am trying to save memory, cpu, and bandwidth as much as possible in case if I have million of urls to crawl.
crawl-urlfilter.txt in the conf/ directory. # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*apache.org/ # skip everything else -. Assume my urls/seed file contains million of urls to fetch, crawl, generate, and etc, and I don't want to go through million of lines to review each. Let's say I have these lines in my urls/seed file (file will million urls) http://apache.org http://subdomain1.apache.org http://subdomain2.apache.org http://subdomain3.apache.org http://subdomain4.apache.org http://subdomain5.apache.org Correct me if I am wrong, but if 'http://apache.org' also crawls subdomains with '+^http://([a-z0-9]*\.)*apache.org/' in crawl-urlfilter.txt, then wouldn't it crawl more than once if I have subdomain lines in my urls/seed file? I don't mind if I have one or two, but it would waste a lot of cpu, memory, and bandwidth as my continue to grow with urls. If that was the issue, then have anyone thought of a way to file out all the subdomains in his/her urls/seed file. I'm trying to find out of a way (maybe there is a better method) to search each line for for than 1 dot ".". If there is no other way, then does any know to search which lines have more than 1 "." in each line using unix commands (vi,awk,sed)? Thank you very much