Hi,
--- Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> [EMAIL PROTECTED] wrote:
> > Hello,
> >
> > I'm using Nutch's Fetcher for my Simpy.com project, and one of the
> > things I'd like to do is detect broken links (any type of error -
> wrong
> > host name, 404, 500, 302, etc.). From what I can tell, only
> successful
> > fetches (200s and maybe 301/302s that result in 200) end up being
> > written to disk, while all other links don't get stored anywhere
> (ASF
> > SVN is down right now, can't double-check this).
> >
>
> Not anymore - the current code in SVN records everything, even failed
>
> fetches, precisely for the reasons you mentioned. If there is any
> content associated with a failed fetch, then this content is written
> down as well.
>
> However, the failure code is translated from protocol-specific codes
> to generic codes.
Aha, yes, that is what I saw in the repo this morning. Good!
> > What's the best place to plug in some code to grab the broken links
> as
> > their fetches are failing? I looked at Fetcher.java this morning
> and
> > saw handleFetch and handleNoFetch methods. Is this the best place
> to
> > add code for my purposes? I'm not too familiar with Nutch's plugin
> > system, but can I write a plugin that plugs into those 2 methods?
>
> In the latest SVN these methods are handleFetch and logError. As
> Stefan
> explained, you would have to modify these methods to invoke plugins
> under a new extension point. This is not complicated at all, the
> general
> contract usually is that plugins act as chained filters, which
> process
> their arguments in order. Each extension point follows an implicit
> "mini-contract" about what the processing result means - often a null
>
> result means "discard and proceed to next entry".
>
> >
> > Or is there a ways to give Nutch a URL and get its HTTP status
> response
> > code back after fetching, merging, indexing, and optimizing is
> done?
>
> The specific protocol-dependent response codes are not recorded.
> However, translated error codes are recorded in segment data, and a
> subset of these translated codes is recorded in WebDB.
It looks like I may not even need to write a plugin, as these
translated error codes may be sufficient for now.
I tried Nutch from SVN, but I'm getting this error:
Exception in thread "fetcher3" java.lang.NoSuchMethodError:
org.apache.nutch.parse.ParseData.<init>(Ljava/lang/String;[Lorg/apache/nutch/parse/Outlink;Ljava/util/Properties;)V
at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:214)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:254)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:204)
Does this look familiar to anyone?
Thanks,
Otis