Re: Infinite loop bug in Nutch 0.9

Julien Nioche Thu, 02 Apr 2009 09:16:55 -0700

George,

Try using Nutch-1.0 instead. I have tested your example with the SVN version
and it did not get into the problem you described.


J.

2009/4/2 George Herlin <ghher...@gmail.com>

> Indeed I have... that's how I found out.
>
> My test case: crawl
>
> http://www.purdue.ca/research/research_clinical.asp
>
> with crawl-urlfilter and regex-urlfilter ending with
>
> #purdue
> +^http://www.purdue.ca/research/
> +^http://www.purdue.ca/pdf/
>
> # reject anything else
> -.
>
> The site is very small (which helped in diagnosis).
>
> Attached the beginning of a run log, just in case
>
> brgds
>
> George
>
> ================LOG====================
> Resource not found: commons-logging.properties
> Resource not found: META-INF/services/org.apache.commons.logging.LogFactory
> Resource not found: log4j.xml
> Resource found: log4j.properties
> Resource found: hadoop-default.xml
> Resource found: hadoop-site.xml
> Resource found: nutch-default.xml
> Resource found: nutch-site.xml
> Resource not found: crawl-tool.xml
> Injector: starting
> Injector: crawlDb: crawl-www.purdue.ca-20090402110952/crawldb
> Injector: urlDir: conf/purdueHttp
> Injector: Converting injected urls to crawl db entries.
> Resource not found:
> META-INF/services/javax.xml.transform.TransformerFactory
> Resource not found:
>
> META-INF/services/com.sun.org.apache.xalan.internal.xsltc.dom.XSLTCDTMManager
> Resource not found:
> com/sun/org/apache/xml/internal/serializer/XMLEntities_en.properties
> Resource not found:
> com/sun/org/apache/xml/internal/serializer/XMLEntities_en_US.properties
> Resource found: regex-normalize.xml
> Resource found: regex-urlfilter.txt
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment:
> crawl-www.purdue.ca-20090402110952/segments/20090402110955
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment:
> crawl-www.purdue.ca-20090402110952/segments/20090402110955
> Fetcher: threads: 1
> Resource found: parse-plugins.xml
> fetching http://www.purdue.ca/research/research_clinical.asp
> Resource found: mime-types.xml
> Resource not found: META-INF/services/org.apache.xerces.impl.Version
> Resource found: www.purdue.ca.html.parser-conf.properties
> Resource found: www.purdue.ca.resultslist.html.parser-conf.properties
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl-www.purdue.ca-20090402110952/crawldb
> CrawlDb update: segments:
> [crawl-www.purdue.ca-20090402110952/segments/20090402110955]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment:
> crawl-www.purdue.ca-20090402110952/segments/20090402111003
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment:
> crawl-www.purdue.ca-20090402110952/segments/20090402111003
> Fetcher: threads: 1
> fetching http://www.purdue.ca/research/
> fetching http://www.purdue.ca/research/research_ongoing.asp
> fetching http://www.purdue.ca/research/research_quality.asp
> fetching http://www.purdue.ca/research/research_completed.asp
> fetching http://www.purdue.ca/research/research_contin.asp
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl-www.purdue.ca-20090402110952/crawldb
> CrawlDb update: segments:
> [crawl-www.purdue.ca-20090402110952/segments/20090402111003]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment:
> crawl-www.purdue.ca-20090402110952/segments/20090402111024
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment:
> crawl-www.purdue.ca-20090402110952/segments/20090402111024
> Fetcher: threads: 1
> fetching http://www.purdue.ca/research/research.asp
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl-www.purdue.ca-20090402110952/crawldb
> CrawlDb update: segments:
> [crawl-www.purdue.ca-20090402110952/segments/20090402111024]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment:
> crawl-www.purdue.ca-20090402110952/segments/20090402111031
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment:
> crawl-www.purdue.ca-20090402110952/segments/20090402111031
> Fetcher: threads: 1
> fetching http://www.purdue.ca/research/research.asp
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl-www.purdue.ca-20090402110952/crawldb
> CrawlDb update: segments:
> [crawl-www.purdue.ca-20090402110952/segments/20090402111031]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment:
> crawl-www.purdue.ca-20090402110952/segments/20090402111038
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment:
> crawl-www.purdue.ca-20090402110952/segments/20090402111038
> Fetcher: threads: 1
> fetching http://www.purdue.ca/research/research.asp
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl-www.purdue.ca-20090402110952/crawldb
> CrawlDb update: segments:
> [crawl-www.purdue.ca-20090402110952/segments/20090402111038]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> ...
> (it just goes on like that, indefinitely)
> =======================================
>
>
> George Herlin wrote:
> > > On Wed, Apr 1, 2009 at 13:29, George Herlin <ghher...@gmail.com>
> wrote:
> > >
> >> >> Sorry, forgot to say, there is an added precondition to causing
> the bug:
> >> >>
> >> >> The redirection has to be fetched before the page it redirects
> to... if
> >> >> not, there will be a pre.existing crawl datum with an reasonable
> >> >> refetch-interval.
> >> >>
> > >
> > > Maybe this is something fixed between 0.9 and 1.0, but I think
> > > CrawlDbReducer fixes these datums, around line 147 (case
> > > CrawlDatum.STATUS_LINKED). Have you even got stuck in an infinite loop
> > > because of it?
> > >
> > >
> >> >>
> >> >> 2009/4/1 George Herlin <ghher...@gmail.com>
> >> >>
> >> >> Hello, there.
> >>> >>> I believe I may have found a infinite loop in Nutch 0.9.
> >>> >>>
> >>> >>> It happens when a site has a page that refers to itself through a
> >>> >>> redirection.
> >>> >>>
> >>> >>> The code in Fetcher.run(), around line 200 - sorry, my Fetcher
> has been a
> >>> >>> little modified, line numbers may vary a little - says, for that
> case:
> >>> >>>
> >>> >>> output(url, new CrawlDatum(), null, null,
> CrawlDatum.STATUS_LINKED);
> >>> >>>
> >>> >>> What that does is, inserts an extra (empty) crawl datum for the
> new url,
> >>> >>> with a re-fetch interval of 0.0.
> >>> >>>
> >>> >>> However, (see Generator.Selector.map(), particularly lines
> 144-145), the
> >>> >>> non-refetch condition used seems to be
> > > last-fetch+refetch-interval>now ...
> >>> >>> which is always false if refetch-interval==0.0!
> >>> >>>
> >>> >>> Now, if there is a new link to the new url in that page, that
> crawl datum
> >>> >>> is re-used, and the whole thing loops indefinitely.
> >>> >>>
> >>> >>> I've fixed that for myself by changing the quoted line (twice) by:
> >>> >>>
> >>> >>> output(url, new CrawlDatum(CrawlDatum.STATUS_LINKED, 30f), null,
> null,
> >>> >>> CrawlDatum.STATUS_LINKED);
> >>> >>>
> >>> >>> and that works (btw the 30F should really be the value of
> >>> >>> "db.default.fetch.interval", but I haven't the time right now to
> work out
> >>> >>> the issues, but in reality the default constructor and the
> appropriate
> >>> >>> updater method should, if I am right in analysing the algorithm
> always
> >>> >>> enforce a positive refetch interval.
> >>> >>>
> >>> >>> Of course, another method could be used to remove this
> > > self-reference, but
> >>> >>> that couls be complicated, as that may happen through a loop (2
> or more
> >>> >>> pages etc..., you know what I mean).
> >>> >>>
> >>> >>> Has that been fixed already, and by what method?
> >>> >>>
> >>> >>> Best regards
> >>> >>>
> >>> >>> George Herlin
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>>
> > >
> > >
>



-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Re: Infinite loop bug in Nutch 0.9

Reply via email to