Doug Cutting wrote:
The OPIC algorithm is not really designed for re-fetching. It assumes that each link is seen only once. When pages

Ummm. well, this is definitely not our case.

are refetched, their links are processed again. I think the easiest way to fix this would be to change ParseOutputFormat to not generate STATUS_LINKED crawldata when a page has been refetched. That way scores would only be adjusted for links in the original version of a page. This is not perfect, but considerably better

But then we would miss any new links from that page. I think it's not acceptable. Think e.g. of news sites, where links from the same page are changing on a daily or even hourly basis.

than what happens now. Incrementally updating the score would require re-processing the parser outputs to find outlinks from the previous version of the page and then subtracting their contribution from the page's score. This is possible, but not easy.

If you remember, some time ago I proposed a different solution: to involve linkDB in score calculations, and to store these partial OPIC score values in Inlink. This would allow us to track score contributions per source/target pair. Newly discovered links would get the initial partial score value from the originating page, and we could track these values if the original page's score changes (e.g. the number of links increases, or the page's score is updated).

BTW: I've been toying with some patches to implement pluggable scoring mechanisms, it would be easy to provide hooks for custom scoring implementations. Scores are just float values, so they would be sufficient for a wide range of scoring mechanisms, for others the newly added CrawlDatum.metadata could be used.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to