Doug Cutting wrote:
Mehmet Tan wrote:
What I am trying to ask is this:
You have pages A,B,C,D in webdb and then you come
to a page E during the crawl and page E redirects you to page
A for example. Then you create a new Page object in the fetcher
with url A and write this to db (with updatedb). This overwrites page A
already in db, and you lose everything you knew about page A.

Redirects are mostly invisible to Nutch. In the case you describe, the content of url E (which redirects to A) would be the same as the content for A, but these would have separate entries in the CrawlDB, link-graph, etc. We do store the final url in a redirect chain so that we can resolve relative references in the page, but that is not used as the url for the content. The content is always associated with the first url in the redirect chain.

The problem was not conceptual, but in the implementation of CrawlDbReducer, where new "synthetic" CrawlDatum A' (created in response to a redirect) could overwrite CrawlDatum A coming from a legitimate entry in the fetchlist. CrawlDatum A could contain metadata coming form previous fetches, which would be absent in CrawlDatum A', but in the end probably CrawlDatum A' would be picked as the final version to be committed to DB, resulting in a data loss.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to