How about introducing these changes in an effort to force the nutch
admins
to properly edit the bot identity strings?
1. Add the http.agent.* entries to nutch-site.xml with the value being
"EDITME".
    The description should clearly state that these values *must* be
edited
    to reflect the true identity of the site.
2. Add a piece of code to the HTTP crawler that checks the
configuration.
    If any of the http.agent.* entries are EDITME, the code would log
the error and exit.

-kuro
p.s. I'm subscribing to the digest version of the ML.  If the same or
better idea
has been raised already, please ignore this.


Reply via email to