Mehmet Tan wrote:
Andrzej,
Thanks for your response and patch. But I have a few more questions about
adaptive refetch. As far as I understood the solution below is 'not to
overwrite
some fields of the entries' in the db. Assume we applied the adaptive
refetch idea in your patch to the 0.7 version. We have the same
redirection problem there too.
What do you think is the best way to solve this problem there in
version 0.7?
Well, you refer to two different problems:
* there was a problem in CrawlDbReducer that (possibly) new values of
fetchInterval and fetchTime were not applied correctly to the CrawlDatum
to be stored in the DB. The patch contained a fix ONLY for this issue.
* redirection problem: I'm not sure what should be the solution, IMHO
it's a matter of properly setting URLFilters. If you don't allow certain
patterns, you should not collect such urls, no matter if they come from
redirection or directly from the outlinks. If you make an exception for
such urls, next time you generate a fetchlist or updatedb these urls
will be filtered out anyway.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general