Re: Urlfilter Patch

Matt Kangas Fri, 02 Dec 2005 15:37:56 -0800

Doug,

After sleeping on this idea, I realized that there's a middle groundthat may give us (and website operators) the best of both worlds.


The question: how to avoid fetching unparseable content?

Value in answering this:
- save crawl operators bandwidth, disk space, cpu time

- save website operators bandwidth (and maybe cpu time) = be betterweb citizens


Tools availble:

- regex-urlfilter.txt (nearly free to run, but is only an approximateanswer)- HTTP HEAD before GET (cheaper than blind GET, but mainly savesbandwidth, not server cpu)


Proposed strategy:

1) Define regex-urlfilter.txt, as we do now. Continue to weed outknown-unparseable file extensions as early as possible.2) Also define another regex list for extensions that are very likelyto be text/html. (e.g. .html, .php).

Fetch these blindly with HTTP GET.

3) For everything else, perform HTTP HEAD first. If the mime-type isunparseable, do not follow with HTTP GET.


Advantages to this approach:
- still weeds out known-bad stuff as early as possible
- saves crawl+server bandwidth in questionable cases
- saves server load in high-confidence cases (eliminates HTTP HEAD)

Disadvantages: ?


On Dec 1, 2005, at 5:23 PM, Matt Kangas wrote:

Totally agreed. Neither approach replaces the other. I just wantedto mention possibility so people don't over-focus on trying tobuild a hyper-optimized regex list. :)
For the content provider, an HTTP HEAD request saves them bandwidthif we don't do a GET. That's some cost savings for them over doinga blind fetch (esp. if we discard it).
I guess the question is, what's worse:
- two server hits when we find content we want?, or
- spending bandwidth on pages that the Nutch installation willignore anyway?
--matt

On Dec 1, 2005, at 4:40 PM, Doug Cutting wrote:
Matt Kangas wrote:
The latter is not strictly true. Nutch could issue an HTTP HEADbefore the HTTP GET, and determine the mime-type before actuallygrabbing the content.It's not how Nutch works now, but this might be more useful thana super-detailed set of regexes...
This could be a useful addition, but it could not replace url-based filters. A HEAD request must still be polite, so this couldsubstantially slow fetching, as it would incur more delays. Also,for most dynamic pages, a HEAD is as expensive for the server as aGET, so this would cause more load on servers.
Doug
--
Matt Kangas / [EMAIL PROTECTED]


--
Matt Kangas / [EMAIL PROTECTED]

Re: Urlfilter Patch

Reply via email to