I'm noticing that URLs for parked domains with default web pages and domains
that are controlled/owned by registrars are getting crawled.  It seems that
many people are using parked domains as sources of revenue
(http://www.google.com/domainpark/).
 
Anyway, any ideas on how to exclude these during the crawl process?
 
I'm seeing that some parked domains will redirect you to a URL with a
predictable format. For example:
 
http://apps5.oingo.com/apps/domainpark/domainpark.cgi?client=JPET7259&s=darg
o.net
 
Thanks,
James



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to