Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by KenKrugler:
http://wiki.apache.org/nutch/FixingOpicScoring

New page:
== Fixing the OPIC algorithm in Nutch ==
'''Ken Krugler'''

'''''21 August 2006'''''

'''DRAFT'''

== Introduction ==

'''WARNING''' - what I've dumped in below still needs to be completed, and 
cleaned up.

The goal of this proposal is to define the changes needed for Nutch to do a 
real implementation of the Adaptive On-line Page Importance Calculation 
(Adaptive OPIC), as described by this 
[http://www2003.org/cdrom/papers/refereed/p007/p7-abiteboul.html paper].

Currently Nutch uses a partial implementation that's mostly useful for crawling 
"important" pages first. A partial description of this can be found 
[http://www2005.org/cdrom/docs/p864.pdf here].

Unfortunately this has a number of problems, specifically:

 1. You can't recrawl, at least not without having the recrawled pages get 
inflated scores. This isn't such a problem when the score is just being used to 
sort the fetch list, but when the score is also used to determine the page's 
boost (for the corresponding Lucene document) then this is a Bad Thing. And 
that's currently what Nutch does.
 1. The total score of the "system" as defined by the graph of pages continues 
to increase, as each newly crawled page adds to the summed score of the system. 
This then penalizes pages that aren't recrawled, as their score (as a 
percentage of the total system) keeps dropping.

There were other problems that have recently (as of August 2006) been fixed, 
such as self-referencial links creating positive feedback loops.

== The Adaptive OPIC Algorithm ==

I'm going to try to summarize the implementation proposed by the original 
paper, as I think it applies to Nutch, but there are still a few open issues 
that I'm trying to resolve with the authors.

 1. Each page has a "historical cash value" that represents the weight of 
inbound links.
 1. Whenever a page is processed, the page's cash is distributed to outlinks.
 1. Each page also has a "historical cash value" that represents the cash 
that's flowed through the page.
 1. The score of a page is represented by the sum of the historical and current 
cash values.
 1. There is one special, virtual page that has bidirectional links with every 
other page in the entire web graph. 
  a. When a crawl is initially started, this page has a cash value of 1.0, and 
this is then distributed (as 1/n) to the n injected pages.
  a. Whenever a page is being processed, this virtual page can receive some of 
the page's current cash, due to the implicit link from every page to this 
virtual page.
 1. To handle recrawling, every page also has the last time it was processed. 
In addition, there's a fixed "time window" that's used to calculate the 
historical cash value of a page. For the Xyleme crawler, this was set at 3 
months, but it seems to be heavily dependent on the rate of re-crawling 
(average time between page refetches).
 1. When a page is being processed, its historical cash value is calculated in 
one of two ways, based on the page's delta time (time between when it was last 
processeed, and now).
  a. If the delta time is >= the time window, then the historical cash value is 
set to the page's current cash value * (time window/delta time). So using the 
above, if the page's cash value is 10, and the delta time is 6 months, then the 
historical cash value gets set to 10 * (6/3) = 5.0.
  a. If the delta time is < the time window, then the historical cash value is 
set to the page's current cash value + (historical cash value * (time window - 
delta time)/time window). This is kind of odd, but basically it assumes that 
the "weight" of the past (historical cash value saved for the page) decreases 
over time, and the current cash will increase as more pages are processed (and 
thus their inbound contributions contribute to this page's current cash).

=== Details of cash distribution ===

While distributing the cash of a page to the outlinks, there are a few details 
that need to be handled:

 * Some amount of the cash goes to the virtual page, while the rest of the cash 
goes to the real outlinks. If a page is a leaf (no outlinks) then all of the 
cash goes to the virtual page. The ratio of real/virtual can be adjusted to put 
greater emphasis on new pages versus recrawling, but the paper is a bit fuzzy 
about how to do this properly.
 * Self-referencial links should (I think) be ignored. But that's another 
detail to confirm.
 * There's a mention in the paper to adjusting the amount of cash given to 
internal (same domain) links versus external links, but no real details. This 
would be similar to the current Nutch support for providing a different initial 
score for internal vs. external pages, and the "ignore internal links" flag.
 * I'm not sure how best to efficiently implement the virtual page such that it 
efficiently gets cash from every single page that's processed. If you treat it 
as a special URL, then would that slow down the update to the crawldb?

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-cvs mailing list
Nutch-cvs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-cvs

Reply via email to