Bill Goffe wrote:
I'm trying to tweak the search results at my http://ese.rfe.org/ and I've
got two questions (I'm running .7.2):

  - In searching at the above for "unemployment" the leading results have
    10 or more occurrences of that word on the page. I'd like to reduce
    the influence of multiple occurrences of a word on a page and give more
    weight to links, titles, and such. But, in looking at
    nutch-default.xml I don't see any obvious parameters for this. I have
    upped the following to these values:
      indexer.score.power 2.5
      db.score.link.external 4.0
      query.url.boost 2.0
      query.anchor.boost 2.0
      query.title.boost 2.0
      query.phrase.boost 2.0
As the top links for unemployment are state agencies, I think I will switch db.ignore.internal.links back to true as there are more external
    links to where I would like users to go: http://www.bls.gov .

  - In the old 0.7 tutorial, I could swear that the example suggested
    running "nutch analyze", but it no longer mentions that (it's not on
    the Internet Archive). I believe it also suggested running it more
    than once. Thoughts on these lines? I currently run it once after
    "nutch updatedb," but would more runs aid link analysis?

Nutch 0.7 uses a variant of PageRank link analysis, and the analyze tool would perform a couple iterations to propagate the scores along links. However, it was a slow and very resource-hungry process, so sometimes it was even impossible to go through the analysis step even for moderatly-sized collections. 0.7 offers also an option to use a static ranking method, which doesn't require running the analysis step, and which is based on the number of outlinks and inlinks.

Nutch 0.8 uses scoring plugins, which can implement different scoring algorithms. The default one is based on OPIC, which is again a variant of link-based quality metrics - please see OPICScoringFilter for more details.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to