I've now tried this a bit more. If I set fetcher.server.delay to 1 second and use a machine with a faster connection I can now get around
I discovered that things could still slow down if some hosts occur frequently in the fetchlist. So I added a new parameter, http.max.delays, which is the number of times a fetcher thread will try to fetch a page. Each time, if another thread is accessing the host, or has accessed the host within fetcher.server.delay milliseconds, we sleep fetcher.server.delay. The new parameter limits the number of times we'll sleep before giving up on the url.
This gives behaviour much like RequestScheduler, but is much simpler. RequestScheduler queues urls by host, then gives up on urls when the queues are too long.
The general problem is that a polite fetcher (one that serializes accesses to each host with a delay between each access) takes a very long time to fetch all of the pages at, e.g., tripod.com or geocities.com, much longer than it takes to fetch all of the other pages from other hosts politely, slowing down the fetch considerably.
Options I see are:
1. Drop some urls from common hosts (as Fetcher.java now does, and RequestScheduler long has).
2. Have a list of common hosts (geocities, tripod, etc.) which we do not treat politely.
Do folks think (2) is worth exploring?
Doug
------------------------------------------------------- This SF.Net email is sponsored by The 2004 JavaOne(SM) Conference Learn from the experts at JavaOne(SM), Sun's Worldwide Java Developer Conference, June 28 - July 1 at the Moscone Center in San Francisco, CA REGISTER AND SAVE! http://java.sun.com/javaone/sf Priority Code NWMGYKND _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
