Re: Urlfilter Patch

Matt Kangas Thu, 01 Dec 2005 14:23:54 -0800

Totally agreed. Neither approach replaces the other. I just wanted tomention possibility so people don't over-focus on trying to build ahyper-optimized regex list. :)

For the content provider, an HTTP HEAD request saves them bandwidthif we don't do a GET. That's some cost savings for them over doing ablind fetch (esp. if we discard it).


I guess the question is, what's worse:
- two server hits when we find content we want?, or

- spending bandwidth on pages that the Nutch installation will ignoreanyway?


--matt

On Dec 1, 2005, at 4:40 PM, Doug Cutting wrote:

Matt Kangas wrote:
The latter is not strictly true. Nutch could issue an HTTP HEADbefore the HTTP GET, and determine the mime-type before actuallygrabbing the content.It's not how Nutch works now, but this might be more useful thana super-detailed set of regexes...
This could be a useful addition, but it could not replace url-basedfilters. A HEAD request must still be polite, so this couldsubstantially slow fetching, as it would incur more delays. Also,for most dynamic pages, a HEAD is as expensive for the server as aGET, so this would cause more load on servers.
Doug


--
Matt Kangas / [EMAIL PROTECTED]

Re: Urlfilter Patch

Reply via email to