Sorry but I am not sure I could explain the problem properly.
What I am trying to ask is this:
You have pages A,B,C,D in webdb and then you come
to a page E during the crawl and page E redirects you to page
A for example. Then you create a new Page object in the fetcher
with url A and write this to db (with updatedb). This overwrites page A
already in db, and you lose everything you knew about page A.
In version 0.8, you (correct me if I am wrong) copy the old values to not to
overwrite some fields. So I am trying to find out how to solve the above
redirection problem in nutch-0.7, if we apply your adaptive refetch idea to
nutch-0.7.
Thanks..
Mehmet
Andrzej Bialecki wrote:
Mehmet Tan wrote:
Andrzej,
Thanks for your response and patch. But I have a few more questions
about
adaptive refetch. As far as I understood the solution below is 'not
to overwrite
some fields of the entries' in the db. Assume we applied the adaptive
refetch idea in your patch to the 0.7 version. We have the same
redirection problem there too.
What do you think is the best way to solve this problem there in
version 0.7?
Well, you refer to two different problems:
* there was a problem in CrawlDbReducer that (possibly) new values of
fetchInterval and fetchTime were not applied correctly to the
CrawlDatum to be stored in the DB. The patch contained a fix ONLY for
this issue.
* redirection problem: I'm not sure what should be the solution, IMHO
it's a matter of properly setting URLFilters. If you don't allow
certain patterns, you should not collect such urls, no matter if they
come from redirection or directly from the outlinks. If you make an
exception for such urls, next time you generate a fetchlist or
updatedb these urls will be filtered out anyway.
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general