[Nutch-general] Re: Adaptive Refetch

Andrzej Bialecki Wed, 05 Apr 2006 08:46:13 -0700

Mehmet Tan wrote:

Sorry but I am not sure I could explain the problem properly.
What I am trying to ask is this:
You have pages A,B,C,D in webdb and then you come
to a page E during the crawl and page E redirects you to page
A for example. Then you create a new Page object in the fetcher
with url A and write this to db (with updatedb). This overwrites page A
already in db, and you lose everything you knew about page A.
In version 0.8, you (correct me if I am wrong) copy the old values tonot to
overwrite some fields. So I am trying to find out how to solve the above
redirection problem in nutch-0.7, if we apply your adaptive refetchidea to
nutch-0.7.


Ah, ok, I get it now.

Well, first of all in 0.7 there was no metadata to worry about, so theissue is simpler.

In 0.7, if you look at UpdateDatabaseTool, it clones the Page found infetcherOutput. This instance should be equal to the old instance (fromolder DB) + any updates made during fetching. However, if this Pagecomes from a redirect, then it will contain wrong information (newlyinitialized score, see Fetcher:156), that's true. So, theUpdateDatabaseTool:256 should probably usewebdb.addPageIfNotPresent(newPage).

When it comes to 0.8, the situation is slightly different. First, thereis a bug in Fetcher so that currently it doesn't handle redirects basedon parsed content, and doesn't store this information in the segment. :/So, no harm done yet, but purely by accident.

Then, in CrawlDbReducer (latest revision) we copy just the old metadataand all other information is taken from the new CrawlDatum. It's true,however, that if you fetched the same page twice or more in a singlesegment (or even in a single updatedb batch job), then some of theentries will read SUCCESS, but they could contain incomplete data (e.g.no metadata which was stored in CrawlDB and put on a fetchlist). Whichone will be picked - well, it depends on CrawlDatum.compareTo, probablythe latest (which may have come from a redirect). As we loop inCrawlDbReducer, trying to find the "highest" status value, there couldbe more than 1 value with the same status (SUCCESS), and we will be leftwith the last one.


So, the problem still exists, we could lose some data.

A way to solve this would be to introduce CrawlDatum.SUCCESS_REDIRECTED,with a value lower than CrawlDatum.SUCCESS. By default, we probablyshould skip them. Optionally, we could also accumulate in the result anymetadata from all CrawlDatum.SUCCESS* pages, but there is again dangerthat some newly found pages will contain default metadata thatoverwrites values coming from "legitimate" entries in a fetchlist.


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Adaptive Refetch

Reply via email to