Does Nutch have any facility built in to remove pages from an index if the
robots.txt file on a previously crawled site changes to disallow Nutch?

For example, I crawl www.foo.com today, and nutch is allowed.  Tomorrow the
foo.com administrator changes www.foo.com/robots.txt to disallow Nutch.  The
next time Nutch goes over to foo.com, it should see the the new
robots.txtand not crawl anything new, but does it do anything to the
index so that
foo.com pages are not returned in the result?

Thx

Reply via email to