Redundancy issue in crawling

Ken Ken Wed, 20 Jan 2010 14:05:18 -0800

Hello,

I am trying to save memory, cpu, and bandwidth as much as possible in case if I 
have million of urls to crawl.



crawl-urlfilter.txt in the conf/ directory.

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*apache.org/

# skip everything else
-.


Assume my urls/seed file contains million of urls to fetch, crawl, generate, 
and etc, and I don't want to go through million of lines to review each.
Let's say I have these lines in my urls/seed file (file will million urls)


http://apache.org
http://subdomain1.apache.org
http://subdomain2.apache.org
http://subdomain3.apache.org
http://subdomain4.apache.org
http://subdomain5.apache.org

Correct me if I am wrong, but if 'http://apache.org'
 also crawls subdomains with '+^http://([a-z0-9]*\.)*apache.org/' in 
crawl-urlfilter.txt, then wouldn't it crawl more than once if I have subdomain 
lines in my urls/seed file?

I don't mind if I have one or two, but it would waste a lot of cpu, memory, and 
bandwidth as my  continue to grow with urls.

If that was the issue, then have anyone thought of a way to file out all the 
subdomains in his/her urls/seed file.

I'm trying to find out of a way (maybe there is a better method) to search each 
line for for than 1 dot ".".  If there is no other way, then does any know to 
search which lines have more than 1 "." in each line using unix commands 
(vi,awk,sed)?

Thank you very much

Redundancy issue in crawling

Reply via email to