Doug,

After sleeping on this idea, I realized that there's a middle ground that may give us (and website operators) the best of both worlds.

The question: how to avoid fetching unparseable content?

Value in answering this:
- save crawl operators bandwidth, disk space, cpu time
- save website operators bandwidth (and maybe cpu time) = be better web citizens

Tools availble:
- regex-urlfilter.txt (nearly free to run, but is only an approximate answer) - HTTP HEAD before GET (cheaper than blind GET, but mainly saves bandwidth, not server cpu)

Proposed strategy:

1) Define regex-urlfilter.txt, as we do now. Continue to weed out known-unparseable file extensions as early as possible. 2) Also define another regex list for extensions that are very likely to be text/html. (e.g. .html, .php).
Fetch these blindly with HTTP GET.
3) For everything else, perform HTTP HEAD first. If the mime-type is unparseable, do not follow with HTTP GET.

Advantages to this approach:
- still weeds out known-bad stuff as early as possible
- saves crawl+server bandwidth in questionable cases
- saves server load in high-confidence cases (eliminates HTTP HEAD)

Disadvantages: ?


On Dec 1, 2005, at 5:23 PM, Matt Kangas wrote:

Totally agreed. Neither approach replaces the other. I just wanted to mention possibility so people don't over-focus on trying to build a hyper-optimized regex list. :)

For the content provider, an HTTP HEAD request saves them bandwidth if we don't do a GET. That's some cost savings for them over doing a blind fetch (esp. if we discard it).

I guess the question is, what's worse:
- two server hits when we find content we want?, or
- spending bandwidth on pages that the Nutch installation will ignore anyway?

--matt

On Dec 1, 2005, at 4:40 PM, Doug Cutting wrote:

Matt Kangas wrote:
The latter is not strictly true. Nutch could issue an HTTP HEAD before the HTTP GET, and determine the mime-type before actually grabbing the content. It's not how Nutch works now, but this might be more useful than a super-detailed set of regexes...

This could be a useful addition, but it could not replace url- based filters. A HEAD request must still be polite, so this could substantially slow fetching, as it would incur more delays. Also, for most dynamic pages, a HEAD is as expensive for the server as a GET, so this would cause more load on servers.

Doug

--
Matt Kangas / [EMAIL PROTECTED]



--
Matt Kangas / [EMAIL PROTECTED]


Reply via email to