I'm noticing that URLs for parked domains with default web pages and domains that are controlled/owned by registrars are getting crawled. It seems that many people are using parked domains as sources of revenue (http://www.google.com/domainpark/). Anyway, any ideas on how to exclude these during the crawl process? I'm seeing that some parked domains will redirect you to a URL with a predictable format. For example: http://apps5.oingo.com/apps/domainpark/domainpark.cgi?client=JPET7259&s=darg o.net Thanks, James
------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
