Certainly, Nutch must follow robots.txt. Otherwise you risk your IP banned, or worse.
I find quite illogical the stance of not changing robots.txt because an agent can declare a fake agent name, and on the other hand letting a crawler that ignores robots.txt run over your site. 2009/9/11 Fuad Efendi <f...@efendi.ca> > > > > My sysadm refuses to change the robots.txt citing the following reason: > > > > The moment he allows a specific agent, a lot of crawlers impersonate > > as that user agent and tries to crawl that site. > > > > Extremely strange thoughts of some smart sys-minds... > > If crawler wants impersonate... it will, and it will ignore robots.txt, and > sysadmin may ban such IP... I don't know any such public crawler except > some > desktop based download agents such as WebCEO or Teleport or even IE and > Firefox... > > No way, Nutch must follow robots.txt. > > >