If some has to adopt the plugin, it has to go with new crawling. Will there be a way, where we could apply these scoring mechanisms to existing already fetched, indexed and merged pages too.
Can you please shed some light?

I think it would be possible to write a map-reduce job that simulated the crawl of all current pages, to the extent necessary to get reasonable history/page cash values for OPIC. But that's just a guess until the actual implementation is at least sketched out.

-- Ken

Andrzej Bialecki <[EMAIL PROTECTED]> wrote: Ken Krugler wrote:
 Eugen Kochuev wrote:
 Hello Andrzej,

 Please see the scoring API - you can write a plugin that manipulates
 page scores according to your own idea.

 Thanks a lot for your answer, but could you please shed some more
 light onto scoring technique used in the Nutch?
 As I can see from the source code Nutch uses something similar to the
 pagerank algorithm propagating page scores through outlinks, but
 only one
 iteration is used (while pagerank requires several iterations to
 converge).

 That's a bit complicated subject - I could either explain this in
 very general terms, or suggest that you read the paper that underlies
 the current Nutch implementation (with a twist). Please see the
 comment in OPICScoringFilter.java for the link to the paper.

 I've started writing up a description of the changes that I think need
 to be made to Nutch to really implement the OPIC algorithm, as
 described by by the "Adaptive On-Line Page Importance Computation"
 paper (ACM 1-58113-680-3/03/0005).

 Should I just open a JIRA issue, and dump what might be a pretty long
 write-up into it?

Yes, please do - I'd love to implement this in that original form, even
if it would go into another plugin ...

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com


--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to