Doug Cutting wrote:
Andrzej Bialecki wrote:
Doug Cutting wrote:
are refetched, their links are processed again. I think the easiest
way to fix this would be to change ParseOutputFormat to not generate
STATUS_LINKED crawldata when a page has been refetched. That way
scores would only be adjusted for links in the original version of a
page. This is not perfect, but considerably better
But then we would miss any new links from that page. I think it's not
acceptable. Think e.g. of news sites, where links from the same page
are changing on a daily or even hourly basis.
Good point. Then maybe then we should add a new status just for this,
STATUS_REFRESH_LINK. If this is the only datum for a page, then the
page could be added with its inherited score, but otherwise, if it is
an already known page, the score increment is ignored. That way the
scores for existing pages would not change due to recrawling, but new
pages would still be added with a score influenced by the page that
linked to them. Still not perfect, but better.
Yes, I think it could solve the problem for now, until we find a better
solution... What I don't like about this solution is that IMHO it makes
more and more code dependent on the particular implementation of scoring
(the current OPIC algo).
If you remember, some time ago I proposed a different solution: to
involve linkDB in score calculations, and to store these partial OPIC
score values in Inlink. This would allow us to track score
contributions per source/target pair. Newly discovered links would
get the initial partial score value from the originating page, and we
could track these values if the original page's score changes (e.g.
the number of links increases, or the page's score is updated).
Involving the linkdb in score calculations means that the linkdb is
involved in crawldb updates, which makes crawldb updates much slower,
since the linkdb generally has many times more entries than the
crawldb. The linkdb is not required for batch crawling and OPIC
scoring, a common case. So if we wish to implement things this way we
should make it optional. For example, an initial crawl could be done
using the current algorithm while subsequent crawls could use a
slower, incrementally updating algorithm.
Crawls (i.e. generate+fetch) would not use linkdb in my schema, only
crawldb updates.
BTW: I've been toying with some patches to implement pluggable
scoring mechanisms, it would be easy to provide hooks for custom
scoring implementations. Scores are just float values, so they would
be sufficient for a wide range of scoring mechanisms, for others the
newly added CrawlDatum.metadata could be used.
+1
Ok, I'll work on a patch then...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general