Doug Cutting wrote:
Mehmet Tan wrote:
What I am trying to ask is this:
You have pages A,B,C,D in webdb and then you come
to a page E during the crawl and page E redirects you to page
A for example. Then you create a new Page object in the fetcher
with url A and write this to db (with updatedb). This overwrites page A
already in db, and you lose everything you knew about page A.
Redirects are mostly invisible to Nutch. In the case you describe,
the content of url E (which redirects to A) would be the same as the
content for A, but these would have separate entries in the CrawlDB,
link-graph, etc. We do store the final url in a redirect chain so
that we can resolve relative references in the page, but that is not
used as the url for the content. The content is always associated
with the first url in the redirect chain.
The problem was not conceptual, but in the implementation of
CrawlDbReducer, where new "synthetic" CrawlDatum A' (created in response
to a redirect) could overwrite CrawlDatum A coming from a legitimate
entry in the fetchlist. CrawlDatum A could contain metadata coming form
previous fetches, which would be absent in CrawlDatum A', but in the end
probably CrawlDatum A' would be picked as the final version to be
committed to DB, resulting in a data loss.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general