Matt Kangas wrote:
The latter is not strictly true. Nutch could issue an HTTP HEAD before
the HTTP GET, and determine the mime-type before actually grabbing the
content.
It's not how Nutch works now, but this might be more useful than a
super-detailed set of regexes...
This could be a useful addition, but it could not replace url-based
filters. A HEAD request must still be polite, so this could
substantially slow fetching, as it would incur more delays. Also, for
most dynamic pages, a HEAD is as expensive for the server as a GET, so
this would cause more load on servers.
Doug