Mehmet Tan wrote:
What I am trying to ask is this:
You have pages A,B,C,D in webdb and then you come
to a page E during the crawl and page E redirects you to page
A for example. Then you create a new Page object in the fetcher
with url A and write this to db (with updatedb). This overwrites page A
already in db, and you lose everything you knew about page A.

Redirects are mostly invisible to Nutch. In the case you describe, the content of url E (which redirects to A) would be the same as the content for A, but these would have separate entries in the CrawlDB, link-graph, etc. We do store the final url in a redirect chain so that we can resolve relative references in the page, but that is not used as the url for the content. The content is always associated with the first url in the redirect chain.

Doug


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to