Mehmet Tan wrote:

Sorry but I am not sure I could explain the problem properly.
What I am trying to ask is this:
You have pages A,B,C,D in webdb and then you come
to a page E during the crawl and page E redirects you to page
A for example. Then you create a new Page object in the fetcher
with url A and write this to db (with updatedb). This overwrites page A
already in db, and you lose everything you knew about page A.

In version 0.8, you (correct me if I am wrong) copy the old values to not to
overwrite some fields. So I am trying to find out how to solve the above
redirection problem in nutch-0.7, if we apply your adaptive refetch idea to
nutch-0.7.

Ah, ok, I get it now.

Well, first of all in 0.7 there was no metadata to worry about, so the issue is simpler.

In 0.7, if you look at UpdateDatabaseTool, it clones the Page found in fetcherOutput. This instance should be equal to the old instance (from older DB) + any updates made during fetching. However, if this Page comes from a redirect, then it will contain wrong information (newly initialized score, see Fetcher:156), that's true. So, the UpdateDatabaseTool:256 should probably use webdb.addPageIfNotPresent(newPage).

When it comes to 0.8, the situation is slightly different. First, there is a bug in Fetcher so that currently it doesn't handle redirects based on parsed content, and doesn't store this information in the segment. :/ So, no harm done yet, but purely by accident.

Then, in CrawlDbReducer (latest revision) we copy just the old metadata and all other information is taken from the new CrawlDatum. It's true, however, that if you fetched the same page twice or more in a single segment (or even in a single updatedb batch job), then some of the entries will read SUCCESS, but they could contain incomplete data (e.g. no metadata which was stored in CrawlDB and put on a fetchlist). Which one will be picked - well, it depends on CrawlDatum.compareTo, probably the latest (which may have come from a redirect). As we loop in CrawlDbReducer, trying to find the "highest" status value, there could be more than 1 value with the same status (SUCCESS), and we will be left with the last one.

So, the problem still exists, we could lose some data.

A way to solve this would be to introduce CrawlDatum.SUCCESS_REDIRECTED, with a value lower than CrawlDatum.SUCCESS. By default, we probably should skip them. Optionally, we could also accumulate in the result any metadata from all CrawlDatum.SUCCESS* pages, but there is again danger that some newly found pages will contain default metadata that overwrites values coming from "legitimate" entries in a fetchlist.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to