Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by KenKrugler:
http://wiki.apache.org/nutch/FixingOpicScoring

------------------------------------------------------------------------------
    a. When a crawl is initially started, the root page has a cash value of 
1.0, and this is then distributed (as 1/n) to the n injected pages.
    a. Whenever a page is being processed, the root page can receive some of 
the page's current cash, due to the implicit link from every page to the root 
page.
   1. To handle recrawling, every page also has the last time it was processed. 
In addition, there's a fixed "time window" that's used to calculate the 
historical cash value of a page. For the Xyleme crawler, this was set at 3 
months, but it seems to be heavily dependent on the rate of re-crawling 
(average time between page refetches).
+  1. When a page is being processed, its historical cash value is calculated 
from the page's current cash value and the previous historical cash value. The 
historical cash value is estimated via interpolation to come up with an 
"expected" historical cash value, that is close to what you'd get if every page 
was re-fetched and processed at the same, regular interval. Details are below.
-  1. When a page is being processed, its historical cash value is calculated 
in one of two ways, based on the page's delta time (time between when it was 
last processeed, and now).
-   a. If the delta time is >= the time window, then the historical cash value 
is set to the page's current cash value * (time window/delta time). So using 
the above, if the page's cash value is 10, and the delta time is 6 months, then 
the historical cash value gets set to 10 * (6/3) = 5.0.
-   a. If the delta time is < the time window, then the historical cash value 
is set to the page's current cash value + (historical cash value * (time window 
- delta time)/time window). This is kind of odd, but basically it assumes that 
the "weight" of the past (historical cash value saved for the page) decreases 
over time, and the current cash will increase as more pages are processed (and 
thus their inbound contributions contribute to this page's current cash).
  
  === Details of cash distribution ===
  
@@ -46, +44 @@

   * Self-referencial links should (I think) be ignored. But that's another 
detail to confirm.
   * There's a mention in the paper to adjusting the amount of cash given to 
internal (same domain) links versus external links, but no real details. This 
would be similar to the current Nutch support for providing a different initial 
score for internal vs. external pages, and the "ignore internal links" flag.
   * I'm not sure how best to efficiently implement the root page such that it 
efficiently gets cash from every single page that's processed. If you treat it 
as a special URL, then would that slow down the update to the crawldb?
+  * The OPIC paper talks about giving some of the root page cash to pages to 
adjust the crawl priorities. Unfortunately not much detail was provided. The 
three approaches mentioned were:
+   a. Give cash to unfetched pages, to encourage broadening the crawl.
+   a. Give cash to fetched pages, to encourage recrawling.
+   a. Give cash to specific pages in a target area (e.g. by domain), for 
focused crawling.
  
+ === Details of historical cash calculation ===
+ 
+ When a page is being processed, its historical cash value is calculated in 
one of two ways, based on the page's delta time (time between when it was last 
processeed, and now).
+  1. If the delta time is >= the time window, then the historical cash value 
is set to the page's current cash value * (time window/delta time). So using 
the above, if the page's cash value is 10, and the delta time is 6 months, then 
the historical cash value gets set to 10 * (6/3) = 5.0.
+  1. If the delta time is < the time window, then the historical cash value is 
set to the page's current cash value + (historical cash value * (time window - 
delta time)/time window). This is kind of odd, but basically it assumes that 
the "weight" of the past (historical cash value saved for the page) decreases 
over time, and the current cash will increase as more pages are processed (and 
thus their inbound contributions contribute to this page's current cash).
+ 
+ There's an issue with new pages, as these will have a current cash value, but 
no historical cash value. The OPIC paper says that they (Xyleme) use an average 
value for recently introduced pages, but there aren't any more details. I'm 
trying to get some clarification.
+ 

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-cvs mailing list
Nutch-cvs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-cvs

Reply via email to