How about introducing these changes in an effort to force the nutch admins to properly edit the bot identity strings? 1. Add the http.agent.* entries to nutch-site.xml with the value being "EDITME". The description should clearly state that these values *must* be edited to reflect the true identity of the site. 2. Add a piece of code to the HTTP crawler that checks the configuration. If any of the http.agent.* entries are EDITME, the code would log the error and exit.
-kuro p.s. I'm subscribing to the digest version of the ML. If the same or better idea has been raised already, please ignore this.