Mehmet Tan wrote:
Sorry but I am not sure I could explain the problem properly.
What I am trying to ask is this:
You have pages A,B,C,D in webdb and then you come
to a page E during the crawl and page E redirects you to page
A for example. Then you create a new Page object in the fetcher
with url A and write this to db (with updatedb). This overwrites page A
already in db, and you lose everything you knew about page A.
In version 0.8, you (correct me if I am wrong) copy the old values to
not to
overwrite some fields. So I am trying to find out how to solve the above
redirection problem in nutch-0.7, if we apply your adaptive refetch
idea to
nutch-0.7.
Ah, ok, I get it now.
Well, first of all in 0.7 there was no metadata to worry about, so the
issue is simpler.
In 0.7, if you look at UpdateDatabaseTool, it clones the Page found in
fetcherOutput. This instance should be equal to the old instance (from
older DB) + any updates made during fetching. However, if this Page
comes from a redirect, then it will contain wrong information (newly
initialized score, see Fetcher:156), that's true. So, the
UpdateDatabaseTool:256 should probably use
webdb.addPageIfNotPresent(newPage).
When it comes to 0.8, the situation is slightly different. First, there
is a bug in Fetcher so that currently it doesn't handle redirects based
on parsed content, and doesn't store this information in the segment. :/
So, no harm done yet, but purely by accident.
Then, in CrawlDbReducer (latest revision) we copy just the old metadata
and all other information is taken from the new CrawlDatum. It's true,
however, that if you fetched the same page twice or more in a single
segment (or even in a single updatedb batch job), then some of the
entries will read SUCCESS, but they could contain incomplete data (e.g.
no metadata which was stored in CrawlDB and put on a fetchlist). Which
one will be picked - well, it depends on CrawlDatum.compareTo, probably
the latest (which may have come from a redirect). As we loop in
CrawlDbReducer, trying to find the "highest" status value, there could
be more than 1 value with the same status (SUCCESS), and we will be left
with the last one.
So, the problem still exists, we could lose some data.
A way to solve this would be to introduce CrawlDatum.SUCCESS_REDIRECTED,
with a value lower than CrawlDatum.SUCCESS. By default, we probably
should skip them. Optionally, we could also accumulate in the result any
metadata from all CrawlDatum.SUCCESS* pages, but there is again danger
that some newly found pages will contain default metadata that
overwrites values coming from "legitimate" entries in a fetchlist.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general