The latter is not strictly true. Nutch could issue an HTTP HEAD before the HTTP GET, and determine the mime-type before actually grabbing the content.

It's not how Nutch works now, but this might be more useful than a super-detailed set of regexes...

[EMAIL PROTECTED]:~$ telnet localhost 80
Trying 127.0.0.1...
Connected to localhost.localdomain.
Escape character is '^]'.
HEAD / HTTP/1.0

HTTP/1.1 200 OK
Date: Thu, 01 Dec 2005 21:25:38 GMT
Server: Apache/2.0
Connection: close
Content-Type: text/html; charset=UTF-8

Connection closed by foreign host



On Dec 1, 2005, at 4:21 PM, Doug Cutting wrote:

Chris Mattmann wrote:
  In principle, the mimeType system should give us some guidance on
determining the appropriate mimeType for the content, regardless of whether
it ends in .foo, .bar or the like.

Right, but the URL filters run long before we know the mime type, in order to try to keep us from fetching lots of stuff we can't process. The mime type is not known until we've fetched it.

Doug

--
Matt Kangas / [EMAIL PROTECTED]


Reply via email to