Bill Goffe wrote:
I'm trying to tweak the search results at my http://ese.rfe.org/ and I've got two questions (I'm running .7.2):- In searching at the above for "unemployment" the leading results have 10 or more occurrences of that word on the page. I'd like to reduce the influence of multiple occurrences of a word on a page and give more weight to links, titles, and such. But, in looking at nutch-default.xml I don't see any obvious parameters for this. I have upped the following to these values: indexer.score.power 2.5 db.score.link.external 4.0 query.url.boost 2.0 query.anchor.boost 2.0 query.title.boost 2.0 query.phrase.boost 2.0As the top links for unemployment are state agencies, I think I will switch db.ignore.internal.links back to true as there are more externallinks to where I would like users to go: http://www.bls.gov . - In the old 0.7 tutorial, I could swear that the example suggested running "nutch analyze", but it no longer mentions that (it's not on the Internet Archive). I believe it also suggested running it more than once. Thoughts on these lines? I currently run it once after "nutch updatedb," but would more runs aid link analysis?
Nutch 0.7 uses a variant of PageRank link analysis, and the analyze tool would perform a couple iterations to propagate the scores along links. However, it was a slow and very resource-hungry process, so sometimes it was even impossible to go through the analysis step even for moderatly-sized collections. 0.7 offers also an option to use a static ranking method, which doesn't require running the analysis step, and which is based on the number of outlinks and inlinks.
Nutch 0.8 uses scoring plugins, which can implement different scoring algorithms. The default one is based on OPIC, which is again a variant of link-based quality metrics - please see OPICScoringFilter for more details.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
