Doug,
After sleeping on this idea, I realized that there's a middle ground
that may give us (and website operators) the best of both worlds.
The question: how to avoid fetching unparseable content?
Value in answering this:
- save crawl operators bandwidth, disk space, cpu time
- save website operators bandwidth (and maybe cpu time) = be better
web citizens
Tools availble:
- regex-urlfilter.txt (nearly free to run, but is only an approximate
answer)
- HTTP HEAD before GET (cheaper than blind GET, but mainly saves
bandwidth, not server cpu)
Proposed strategy:
1) Define regex-urlfilter.txt, as we do now. Continue to weed out
known-unparseable file extensions as early as possible.
2) Also define another regex list for extensions that are very likely
to be text/html. (e.g. .html, .php).
Fetch these blindly with HTTP GET.
3) For everything else, perform HTTP HEAD first. If the mime-type is
unparseable, do not follow with HTTP GET.
Advantages to this approach:
- still weeds out known-bad stuff as early as possible
- saves crawl+server bandwidth in questionable cases
- saves server load in high-confidence cases (eliminates HTTP HEAD)
Disadvantages: ?
On Dec 1, 2005, at 5:23 PM, Matt Kangas wrote:
Totally agreed. Neither approach replaces the other. I just wanted
to mention possibility so people don't over-focus on trying to
build a hyper-optimized regex list. :)
For the content provider, an HTTP HEAD request saves them bandwidth
if we don't do a GET. That's some cost savings for them over doing
a blind fetch (esp. if we discard it).
I guess the question is, what's worse:
- two server hits when we find content we want?, or
- spending bandwidth on pages that the Nutch installation will
ignore anyway?
--matt
On Dec 1, 2005, at 4:40 PM, Doug Cutting wrote:
Matt Kangas wrote:
The latter is not strictly true. Nutch could issue an HTTP HEAD
before the HTTP GET, and determine the mime-type before actually
grabbing the content.
It's not how Nutch works now, but this might be more useful than
a super-detailed set of regexes...
This could be a useful addition, but it could not replace url-
based filters. A HEAD request must still be polite, so this could
substantially slow fetching, as it would incur more delays. Also,
for most dynamic pages, a HEAD is as expensive for the server as a
GET, so this would cause more load on servers.
Doug
--
Matt Kangas / [EMAIL PROTECTED]
--
Matt Kangas / [EMAIL PROTECTED]