Sorry but I am not sure I could explain the problem properly.
What I am trying to ask is this:
You have pages A,B,C,D in webdb and then you come
to a page E during the crawl and page E redirects you to page
A for example. Then you create a new Page object in the fetcher
with url A and write this to db (with updatedb). This overwrites page A
already in db, and you lose everything you knew about page A.

In version 0.8, you (correct me if I am wrong) copy the old values to not to
overwrite some fields. So I am trying to find out how to solve the above
redirection problem in nutch-0.7, if we apply your adaptive refetch idea to
nutch-0.7.

Thanks..

Mehmet

Andrzej Bialecki wrote:

Mehmet Tan wrote:


  Andrzej,
Thanks for your response and patch. But I have a few more questions about adaptive refetch. As far as I understood the solution below is 'not to overwrite some fields of the entries' in the db. Assume we applied the adaptive refetch idea in your patch to the 0.7 version. We have the same redirection problem there too. What do you think is the best way to solve this problem there in version 0.7?


Well, you refer to two different problems:

* there was a problem in CrawlDbReducer that (possibly) new values of fetchInterval and fetchTime were not applied correctly to the CrawlDatum to be stored in the DB. The patch contained a fix ONLY for this issue.

* redirection problem: I'm not sure what should be the solution, IMHO it's a matter of properly setting URLFilters. If you don't allow certain patterns, you should not collect such urls, no matter if they come from redirection or directly from the outlinks. If you make an exception for such urls, next time you generate a fetchlist or updatedb these urls will be filtered out anyway.




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to