Does Nutch have any facility built in to remove pages from an index if the robots.txt file on a previously crawled site changes to disallow Nutch?
For example, I crawl www.foo.com today, and nutch is allowed. Tomorrow the foo.com administrator changes www.foo.com/robots.txt to disallow Nutch. The next time Nutch goes over to foo.com, it should see the the new robots.txtand not crawl anything new, but does it do anything to the index so that foo.com pages are not returned in the result? Thx
