[EMAIL PROTECTED] wrote:
Hello,

I'm using Nutch's Fetcher for my Simpy.com project, and one of the
things I'd like to do is detect broken links (any type of error - wrong
host name, 404, 500, 302, etc.).  From what I can tell, only successful
fetches  (200s and maybe 301/302s that result in 200) end up being
written to disk, while all other links don't get stored anywhere (ASF
SVN is down right now, can't double-check this).


Not anymore - the current code in SVN records everything, even failed fetches, precisely for the reasons you mentioned. If there is any content associated with a failed fetch, then this content is written down as well.

However, the failure code is translated from protocol-specific codes to generic codes.

What's the best place to plug in some code to grab the broken links as
their fetches are failing?  I looked at Fetcher.java this morning and
saw handleFetch and handleNoFetch methods.  Is this the best place to
add code for my purposes?  I'm not too familiar with Nutch's plugin
system, but can I write a plugin that plugs into those 2 methods?

In the latest SVN these methods are handleFetch and logError. As Stefan explained, you would have to modify these methods to invoke plugins under a new extension point. This is not complicated at all, the general contract usually is that plugins act as chained filters, which process their arguments in order. Each extension point follows an implicit "mini-contract" about what the processing result means - often a null result means "discard and proceed to next entry".


Or is there a ways to give Nutch a URL and get its HTTP status response
code back after fetching, merging, indexing, and optimizing is done?

The specific protocol-dependent response codes are not recorded. However, translated error codes are recorded in segment data, and a subset of these translated codes is recorded in WebDB.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to