On Wed, Apr 1, 2009 at 13:29, George Herlin <ghher...@gmail.com> wrote:
> Sorry, forgot to say, there is an added precondition to causing the bug: > > The redirection has to be fetched before the page it redirects to... if > not, there will be a pre.existing crawl datum with an reasonable > refetch-interval. > Maybe this is something fixed between 0.9 and 1.0, but I think CrawlDbReducer fixes these datums, around line 147 (case CrawlDatum.STATUS_LINKED). Have you even got stuck in an infinite loop because of it? > > > 2009/4/1 George Herlin <ghher...@gmail.com> > > Hello, there. >> >> I believe I may have found a infinite loop in Nutch 0.9. >> >> It happens when a site has a page that refers to itself through a >> redirection. >> >> The code in Fetcher.run(), around line 200 - sorry, my Fetcher has been a >> little modified, line numbers may vary a little - says, for that case: >> >> output(url, new CrawlDatum(), null, null, CrawlDatum.STATUS_LINKED); >> >> What that does is, inserts an extra (empty) crawl datum for the new url, >> with a re-fetch interval of 0.0. >> >> However, (see Generator.Selector.map(), particularly lines 144-145), the >> non-refetch condition used seems to be last-fetch+refetch-interval>now ... >> which is always false if refetch-interval==0.0! >> >> Now, if there is a new link to the new url in that page, that crawl datum >> is re-used, and the whole thing loops indefinitely. >> >> I've fixed that for myself by changing the quoted line (twice) by: >> >> output(url, new CrawlDatum(CrawlDatum.STATUS_LINKED, 30f), null, null, >> CrawlDatum.STATUS_LINKED); >> >> and that works (btw the 30F should really be the value of >> "db.default.fetch.interval", but I haven't the time right now to work out >> the issues, but in reality the default constructor and the appropriate >> updater method should, if I am right in analysing the algorithm always >> enforce a positive refetch interval. >> >> Of course, another method could be used to remove this self-reference, but >> that couls be complicated, as that may happen through a loop (2 or more >> pages etc..., you know what I mean). >> >> Has that been fixed already, and by what method? >> >> Best regards >> >> George Herlin >> >> >> >> > -- Doğacan Güney