[Nutch-general] Re: Adaptive Refetching

Andrzej Bialecki Thu, 09 Mar 2006 07:22:06 -0800

Doug Cutting wrote:

Andrzej Bialecki wrote:
Doug Cutting wrote:
are refetched, their links are processed again. I think the easiestway to fix this would be to change ParseOutputFormat to not generateSTATUS_LINKED crawldata when a page has been refetched. That wayscores would only be adjusted for links in the original version of apage. This is not perfect, but considerably better
But then we would miss any new links from that page. I think it's notacceptable. Think e.g. of news sites, where links from the same pageare changing on a daily or even hourly basis.
Good point. Then maybe then we should add a new status just for this,STATUS_REFRESH_LINK. If this is the only datum for a page, then thepage could be added with its inherited score, but otherwise, if it isan already known page, the score increment is ignored. That way thescores for existing pages would not change due to recrawling, but newpages would still be added with a score influenced by the page thatlinked to them. Still not perfect, but better.

Yes, I think it could solve the problem for now, until we find a bettersolution... What I don't like about this solution is that IMHO it makesmore and more code dependent on the particular implementation of scoring(the current OPIC algo).

If you remember, some time ago I proposed a different solution: toinvolve linkDB in score calculations, and to store these partial OPICscore values in Inlink. This would allow us to track scorecontributions per source/target pair. Newly discovered links wouldget the initial partial score value from the originating page, and wecould track these values if the original page's score changes (e.g.the number of links increases, or the page's score is updated).
Involving the linkdb in score calculations means that the linkdb isinvolved in crawldb updates, which makes crawldb updates much slower,since the linkdb generally has many times more entries than thecrawldb. The linkdb is not required for batch crawling and OPICscoring, a common case. So if we wish to implement things this way weshould make it optional. For example, an initial crawl could be doneusing the current algorithm while subsequent crawls could use aslower, incrementally updating algorithm.

Crawls (i.e. generate+fetch) would not use linkdb in my schema, onlycrawldb updates.

BTW: I've been toying with some patches to implement pluggablescoring mechanisms, it would be easy to provide hooks for customscoring implementations. Scores are just float values, so they wouldbe sufficient for a wide range of scoring mechanisms, for others thenewly added CrawlDatum.metadata could be used.
+1


Ok, I'll work on a patch then...

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Adaptive Refetching

Reply via email to