We ran into a similar issue a while ago and it appeared as if nutch
was using the "boost" field in the lucene index as field norm. It
also appeared like the boost field was being set by the OPIC scoring
filter which is like pagerank. In our case it was a forum which had a
lot of "navigational links" that gave very high boosts to pages with
a lot of navigation -- pushing "content" pages very far down in the
search results. Things started looking much better when we wrote a
custom scoring filter that ignored OPIC and set the boost of each
document to 1.0.
Note that I'm saying "appeared" quite a few times above because we
didn't trace this all the way down, just noticed that it went away
when we equalized the boost field. Your mileage may vary.
If you want to create your own scoring filter, just clone the
standard nutch OPICScoringFilter and change this method:
public float indexerScore(Text url, Document doc, CrawlDatum dbDatum,
CrawlDatum fetchDatum, Parse parse, Inlinks inlinks,
float initScore)
throws ScoringFilterException {
To return 1.0f for every document.
Hope it helps,
Jasper
On Nov 16, 2007, at 10:26 AM, Sathyam Y wrote:
I am seeing some issues with search quality for an intranet crawl I
am working on, and the problem seems to be related to fieldNorm value
I have two documents both of comparable size, but the fieldNorm
for 'content' field of one document is significantly lower
(2.4414062E-4) :
Here is the output from explain: The second document has a much
higher termFreq but is ranking lower
0.12035767 = (MATCH) sum of:
0.12035767 = (MATCH) weight(content:nfl in 1234), product of:
0.1656037 = queryWeight(content:nfl), product of:
6.644858 = idf(docFreq=5)
0.024922082 = queryNorm
0.7267813 = (MATCH) fieldWeight(content:nfl in 1234),
product of:
1.0 = tf(termFreq(content:nfl)=1)
6.644858 = idf(docFreq=5)
0.109375 = fieldNorm(field=content, doc=1234)
----------------------------------------
0.010692856 = (MATCH) sum of:
0.0032762752 = (MATCH) weight(url:nfl^4.0 in 796), product of:
0.73151344 = queryWeight(url:nfl^4.0), product of:
4.0 = boost
7.338005 = idf(docFreq=2)
0.024922082 = queryNorm
0.004478763 = (MATCH) fieldWeight(url:nfl in 796), product
of:
1.0 = tf(termFreq(url:nfl)=1)
7.338005 = idf(docFreq=2)
6.1035156E-4 = fieldNorm(field=url, doc=796)
8.495634E-4 = (MATCH) weight(content:nfl in 796), product of:
0.1656037 = queryWeight(content:nfl), product of:
6.644858 = idf(docFreq=5)
0.024922082 = queryNorm
0.005130099 = (MATCH) fieldWeight(content:nfl in 796),
product of:
3.1622777 = tf(termFreq(content:nfl)=10)
6.644858 = idf(docFreq=5)
2.4414062E-4 = fieldNorm(field=content, doc=796)
I added some debugs to the indexer and found that the second
document has a lengthNorm of 0.02793 vs 0.03162 for the first
document . Why is the fieldNorm order of magnitude lower? Are there
any other factors that impacts the fieldNorm?
---------------------------------
Never miss a thing. Make Yahoo your homepage.