Doug Cutting wrote:
The OPIC algorithm is not really designed for re-fetching. It assumes that each link is seen only once. When pages
Ummm. well, this is definitely not our case.
are refetched, their links are processed again. I think the easiest way to fix this would be to change ParseOutputFormat to not generate STATUS_LINKED crawldata when a page has been refetched. That way scores would only be adjusted for links in the original version of a page. This is not perfect, but considerably better
But then we would miss any new links from that page. I think it's not acceptable. Think e.g. of news sites, where links from the same page are changing on a daily or even hourly basis.
than what happens now. Incrementally updating the score would require re-processing the parser outputs to find outlinks from the previous version of the page and then subtracting their contribution from the page's score. This is possible, but not easy.
If you remember, some time ago I proposed a different solution: to involve linkDB in score calculations, and to store these partial OPIC score values in Inlink. This would allow us to track score contributions per source/target pair. Newly discovered links would get the initial partial score value from the originating page, and we could track these values if the original page's score changes (e.g. the number of links increases, or the page's score is updated).
BTW: I've been toying with some patches to implement pluggable scoring mechanisms, it would be easy to provide hooks for custom scoring implementations. Scores are just float values, so they would be sufficient for a wide range of scoring mechanisms, for others the newly added CrawlDatum.metadata could be used.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general