Doug Cutting wrote:
Andrzej Bialecki wrote:
Doug Cutting wrote:
are refetched, their links are processed again. I think the easiest way to fix this would be to change ParseOutputFormat to not generate STATUS_LINKED crawldata when a page has been refetched. That way scores would only be adjusted for links in the original version of a page. This is not perfect, but considerably better

But then we would miss any new links from that page. I think it's not acceptable. Think e.g. of news sites, where links from the same page are changing on a daily or even hourly basis.

Good point. Then maybe then we should add a new status just for this, STATUS_REFRESH_LINK. If this is the only datum for a page, then the page could be added with its inherited score, but otherwise, if it is an already known page, the score increment is ignored. That way the scores for existing pages would not change due to recrawling, but new pages would still be added with a score influenced by the page that linked to them. Still not perfect, but better.


Yes, I think it could solve the problem for now, until we find a better solution... What I don't like about this solution is that IMHO it makes more and more code dependent on the particular implementation of scoring (the current OPIC algo).

If you remember, some time ago I proposed a different solution: to involve linkDB in score calculations, and to store these partial OPIC score values in Inlink. This would allow us to track score contributions per source/target pair. Newly discovered links would get the initial partial score value from the originating page, and we could track these values if the original page's score changes (e.g. the number of links increases, or the page's score is updated).

Involving the linkdb in score calculations means that the linkdb is involved in crawldb updates, which makes crawldb updates much slower, since the linkdb generally has many times more entries than the crawldb. The linkdb is not required for batch crawling and OPIC scoring, a common case. So if we wish to implement things this way we should make it optional. For example, an initial crawl could be done using the current algorithm while subsequent crawls could use a slower, incrementally updating algorithm.

Crawls (i.e. generate+fetch) would not use linkdb in my schema, only crawldb updates.


BTW: I've been toying with some patches to implement pluggable scoring mechanisms, it would be easy to provide hooks for custom scoring implementations. Scores are just float values, so they would be sufficient for a wide range of scoring mechanisms, for others the newly added CrawlDatum.metadata could be used.

+1

Ok, I'll work on a patch then...

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to