2. Have a list of common hosts (geocities, tripod, etc.) which we do not treat politely.

Do folks think (2) is worth exploring?

It does sound reasonably to me, but it is obviously a hassle to provide such list. An alternative could be to try to establish such hosts automatically. Domains such as geocities or tripod are usually clusters of servers, so the speed of crawling can be significantly greater for them. How to detect such clusters? A potential hint could be the cumulative number of pages fetched from a single server -- clusters and generally fast servers on high speed connections usually serve an order of magnitude more pages than regular servers.


A simple heuristic, but it could work. I don't know whether it's possible to realize it technically in the current Nutch architecture though. Another alternative could be counting IP aliases for a domain, but this is usually associated with a higher cost.

Dawid


------------------------------------------------------- This SF.Net email is sponsored by The 2004 JavaOne(SM) Conference Learn from the experts at JavaOne(SM), Sun's Worldwide Java Developer Conference, June 28 - July 1 at the Moscone Center in San Francisco, CA REGISTER AND SAVE! http://java.sun.com/javaone/sf Priority Code NWMGYKND _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to