Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by KenKrugler:
http://wiki.apache.org/nutch/FixingOpicScoring

------------------------------------------------------------------------------
   1. Each page also has a "historical cash value" that represents the cash 
that's flowed through the page. Initially this starts out as 0.0.
   1. The score of a page is represented by the sum of the historical and 
current cash values.
   1. There is one special, virtual root page that has bidirectional links with 
every other page in the entire web graph. 
-   a. When a crawl is initially started, the root page has a cash value of 
1.0, and this is then distributed (as 1/n) to the n injected pages. ''(What 
happens when more pages are injected?)''
+   a. When a crawl is initially started, the root page has a cash value of 
1.0, and this is then distributed (as 1/n) to the n injected pages. When more 
pages are injected, it's not clear what happens, but I imagine that some of the 
cash that has accumulated in the root page is distributed to the injected 
pages, thus keeping the total "energy" of the system constant.
    a. Whenever a page is being processed, the root page can receive some of 
the page's current cash, due to the implicit link from every page to the root 
page.
   1. To handle recrawling, every page also has the last time it was processed. 
In addition, there's a fixed "time window" that's used to calculate the 
historical cash value of a page. For the Xyleme crawler, this was set at 3 
months, but it seems to be heavily dependent on the rate of re-crawling 
(average time between page refetches). ''We could use a value derived from 
fetchInterval.''
   1. When a page is being processed, its historical cash value is calculated 
from the page's current cash value and the previous historical cash value. The 
historical cash value is estimated via interpolation to come up with an 
"expected" historical cash value, that is close to what you'd get if every page 
was re-fetched and processed at the same, regular interval. Details are below.

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-cvs mailing list
Nutch-cvs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-cvs

Reply via email to