[EMAIL PROTECTED] wrote:
Hello,
I'm using Nutch's Fetcher for my Simpy.com project, and one of the
things I'd like to do is detect broken links (any type of error - wrong
host name, 404, 500, 302, etc.). From what I can tell, only successful
fetches (200s and maybe 301/302s that result in 200) end up being
written to disk, while all other links don't get stored anywhere (ASF
SVN is down right now, can't double-check this).
Not anymore - the current code in SVN records everything, even failed
fetches, precisely for the reasons you mentioned. If there is any
content associated with a failed fetch, then this content is written
down as well.
However, the failure code is translated from protocol-specific codes to
generic codes.
What's the best place to plug in some code to grab the broken links as
their fetches are failing? I looked at Fetcher.java this morning and
saw handleFetch and handleNoFetch methods. Is this the best place to
add code for my purposes? I'm not too familiar with Nutch's plugin
system, but can I write a plugin that plugs into those 2 methods?
In the latest SVN these methods are handleFetch and logError. As Stefan
explained, you would have to modify these methods to invoke plugins
under a new extension point. This is not complicated at all, the general
contract usually is that plugins act as chained filters, which process
their arguments in order. Each extension point follows an implicit
"mini-contract" about what the processing result means - often a null
result means "discard and proceed to next entry".
Or is there a ways to give Nutch a URL and get its HTTP status response
code back after fetching, merging, indexing, and optimizing is done?
The specific protocol-dependent response codes are not recorded.
However, translated error codes are recorded in segment data, and a
subset of these translated codes is recorded in WebDB.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com