Certainly, Nutch must follow robots.txt.

Otherwise you risk your IP banned, or worse.

I find quite illogical the stance of not changing robots.txt because an
agent can declare a fake agent name, and on the other hand letting a crawler
that ignores robots.txt run over your site.

2009/9/11 Fuad Efendi <f...@efendi.ca>

> >
> > My sysadm refuses to change the robots.txt citing the following reason:
> >
> > The moment he allows a specific agent, a lot of crawlers impersonate
> > as that user agent and tries to crawl that site.
>
>
>
> Extremely strange thoughts of some smart sys-minds...
>
> If crawler wants impersonate... it will, and it will ignore robots.txt, and
> sysadmin may ban such IP... I don't know any such public crawler except
> some
> desktop based download agents such as WebCEO or Teleport or even IE and
> Firefox...
>
> No way, Nutch must follow robots.txt.
>
>
>

Reply via email to