Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by KenKrugler: http://wiki.apache.org/nutch/FixingOpicScoring ------------------------------------------------------------------------------ I'm going to try to summarize the implementation proposed by the original paper, as I think it applies to Nutch, but there are still a few open issues that I'm trying to resolve with the authors. - 1. Each page has a "historical cash value" that represents the weight of inbound links. + 1. Each page has a "current cash value" that represents the weight of inbound links. - 1. Whenever a page is processed, the page's cash is distributed to outlinks. + 1. Whenever a page is processed, the page's current cash is distributed to outlinks. - 1. Each page also has a "historical cash value" that represents the cash that's flowed through the page. + 1. Each page also has a "historical cash value" that represents the cash that's flowed through the page. Initially this starts out as 0.0. 1. The score of a page is represented by the sum of the historical and current cash values. - 1. There is one special, virtual page that has bidirectional links with every other page in the entire web graph. + 1. There is one special, virtual root page that has bidirectional links with every other page in the entire web graph. - a. When a crawl is initially started, this page has a cash value of 1.0, and this is then distributed (as 1/n) to the n injected pages. + a. When a crawl is initially started, the root page has a cash value of 1.0, and this is then distributed (as 1/n) to the n injected pages. - a. Whenever a page is being processed, this virtual page can receive some of the page's current cash, due to the implicit link from every page to this virtual page. + a. Whenever a page is being processed, the root page can receive some of the page's current cash, due to the implicit link from every page to the root page. 1. To handle recrawling, every page also has the last time it was processed. In addition, there's a fixed "time window" that's used to calculate the historical cash value of a page. For the Xyleme crawler, this was set at 3 months, but it seems to be heavily dependent on the rate of re-crawling (average time between page refetches). 1. When a page is being processed, its historical cash value is calculated in one of two ways, based on the page's delta time (time between when it was last processeed, and now). a. If the delta time is >= the time window, then the historical cash value is set to the page's current cash value * (time window/delta time). So using the above, if the page's cash value is 10, and the delta time is 6 months, then the historical cash value gets set to 10 * (6/3) = 5.0. @@ -42, +42 @@ While distributing the cash of a page to the outlinks, there are a few details that need to be handled: - * Some amount of the cash goes to the virtual page, while the rest of the cash goes to the real outlinks. If a page is a leaf (no outlinks) then all of the cash goes to the virtual page. The ratio of real/virtual can be adjusted to put greater emphasis on new pages versus recrawling, but the paper is a bit fuzzy about how to do this properly. + * Some amount of the cash goes to the root page, while the rest of the cash goes to the real outlinks. If a page is a leaf (no outlinks) then all of the cash goes to the root page. The ratio of real/root can be adjusted to put greater emphasis on new pages versus recrawling, but the OPIC paper is a bit fuzzy about how to do this properly. * Self-referencial links should (I think) be ignored. But that's another detail to confirm. * There's a mention in the paper to adjusting the amount of cash given to internal (same domain) links versus external links, but no real details. This would be similar to the current Nutch support for providing a different initial score for internal vs. external pages, and the "ignore internal links" flag. - * I'm not sure how best to efficiently implement the virtual page such that it efficiently gets cash from every single page that's processed. If you treat it as a special URL, then would that slow down the update to the crawldb? + * I'm not sure how best to efficiently implement the root page such that it efficiently gets cash from every single page that's processed. If you treat it as a special URL, then would that slow down the update to the crawldb? ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-cvs mailing list Nutch-cvs@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-cvs