[Nutch-cvs] [Nutch Wiki] Update of "FixingOpicScoring" by AndrzejBialecki

Apache Wiki Wed, 07 Mar 2007 16:53:08 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by AndrzejBialecki:
http://wiki.apache.org/nutch/FixingOpicScoring

------------------------------------------------------------------------------
   1. You can't recrawl, at least not without having the recrawled pages get 
inflated scores. This isn't such a problem when the score is just being used to 
sort the fetch list, but when the score is also used to determine the page's 
boost (for the corresponding Lucene document) then this is a Bad Thing. And 
that's currently what Nutch does.
   1. The total score of the "system" as defined by the graph of pages 
continues to increase, as each newly crawled page adds to the summed score of 
the system. This then penalizes pages that aren't recrawled, as their score (as 
a percentage of the total system) keeps dropping.
  
- There were other problems that have recently (as of August 2006) been fixed, 
such as self-referencial links creating positive feedback loops.
+ There were other problems that have recently (as of August 2006) been fixed, 
such as self-referential links creating positive feedback loops.
  
  == The Adaptive OPIC Algorithm ==
  
  I'm going to try to summarize the implementation proposed by the original 
paper, as I think it applies to Nutch, but there are still a few open issues 
that I'm trying to resolve with the authors.
  
   1. Each page has a "current cash value" that represents the weight of 
inbound links.
-  1. Whenever a page is processed, the page's current cash is distributed to 
outlinks.
+  1. Whenever a page is processed, the page's current cash is distributed to 
outlinks, and zeroed.
-  1. Each page also has a "historical cash value" that represents the cash 
that's flowed through the page. Initially this starts out as 0.0.
+  1. Each page also has a "historical cash value" that represents the cash 
that's flowed into the page in the last iteration. Initially this starts out as 
0.0.
   1. The score of a page is represented by the sum of the historical and 
current cash values.
   1. There is one special, virtual root page that has bidirectional links with 
every other page in the entire web graph. 
    a. When a crawl is initially started, the root page has a cash value of 
1.0, and this is then distributed (as 1/n) to the n injected pages. When more 
pages are injected, it's not clear what happens, but I imagine that some of the 
cash that has accumulated in the root page is distributed to the injected 
pages, thus keeping the total "energy" of the system constant.
-   a. Whenever a page is being processed, the root page can receive some of 
the page's current cash, due to the implicit link from every page to the root 
page.
+   a. Whenever a page is being processed, the root page can receive some of 
the page's current cash, due to the implicit link from every page to the root 
page. So called "dangling nodes", i.e. pages without outlinks, give all of 
their cash to the root page. 
   1. To handle recrawling, every page also has the last time it was processed. 
In addition, there's a fixed "time window" that's used to calculate the 
historical cash value of a page. For the Xyleme crawler, this was set at 3 
months, but it seems to be heavily dependent on the rate of re-crawling 
(average time between page refetches). ''We could use a value derived from 
fetchInterval.''
   1. When a page is being processed, its historical cash value is calculated 
from the page's current cash value and the previous historical cash value. The 
historical cash value is estimated via interpolation to come up with an 
"expected" historical cash value, that is close to what you'd get if every page 
was re-fetched and processed at the same, regular interval. Details are below.
  

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-cvs mailing list
Nutch-cvs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-cvs

[Nutch-cvs] [Nutch Wiki] Update of "FixingOpicScoring" by AndrzejBialecki

Reply via email to