[Nutch-general] Re: Adaptive Refetch

Doug Cutting Wed, 05 Apr 2006 14:31:12 -0700

Mehmet Tan wrote:

What I am trying to ask is this:
You have pages A,B,C,D in webdb and then you come
to a page E during the crawl and page E redirects you to page
A for example. Then you create a new Page object in the fetcher
with url A and write this to db (with updatedb). This overwrites page A
already in db, and you lose everything you knew about page A.

Redirects are mostly invisible to Nutch. In the case you describe, thecontent of url E (which redirects to A) would be the same as the contentfor A, but these would have separate entries in the CrawlDB, link-graph,etc. We do store the final url in a redirect chain so that we canresolve relative references in the page, but that is not used as the urlfor the content. The content is always associated with the first url inthe redirect chain.


Doug


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Adaptive Refetch

Reply via email to