Hi,

--- Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

> [EMAIL PROTECTED] wrote:
> > Hello,
> > 
> > I'm using Nutch's Fetcher for my Simpy.com project, and one of the
> > things I'd like to do is detect broken links (any type of error -
> wrong
> > host name, 404, 500, 302, etc.).  From what I can tell, only
> successful
> > fetches  (200s and maybe 301/302s that result in 200) end up being
> > written to disk, while all other links don't get stored anywhere
> (ASF
> > SVN is down right now, can't double-check this).
> > 
> 
> Not anymore - the current code in SVN records everything, even failed
> 
> fetches, precisely for the reasons you mentioned. If there is any 
> content associated with a failed fetch, then this content is written 
> down as well.
> 
> However, the failure code is translated from protocol-specific codes
> to generic codes.

Aha, yes, that is what I saw in the repo this morning.  Good!

> > What's the best place to plug in some code to grab the broken links
> as
> > their fetches are failing?  I looked at Fetcher.java this morning
> and
> > saw handleFetch and handleNoFetch methods.  Is this the best place
> to
> > add code for my purposes?  I'm not too familiar with Nutch's plugin
> > system, but can I write a plugin that plugs into those 2 methods?
> 
> In the latest SVN these methods are handleFetch and logError. As
> Stefan 
> explained, you would have to modify these methods to invoke plugins 
> under a new extension point. This is not complicated at all, the
> general 
> contract usually is that plugins act as chained filters, which
> process 
> their arguments in order. Each extension point follows an implicit 
> "mini-contract" about what the processing result means - often a null
> 
> result means "discard and proceed to next entry".
> 
> > 
> > Or is there a ways to give Nutch a URL and get its HTTP status
> response
> > code back after fetching, merging, indexing, and optimizing is
> done?
> 
> The specific protocol-dependent response codes are not recorded. 
> However, translated error codes are recorded in segment data, and a 
> subset of these translated codes is recorded in WebDB.

It looks like I may not even need to write a plugin, as these
translated error codes may be sufficient for now.

I tried Nutch from SVN, but I'm getting this error:

Exception in thread "fetcher3" java.lang.NoSuchMethodError:
org.apache.nutch.parse.ParseData.<init>(Ljava/lang/String;[Lorg/apache/nutch/parse/Outlink;Ljava/util/Properties;)V
        at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:214)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:254)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:204)


Does this look familiar to anyone?

Thanks,
Otis

Reply via email to