We ran into a similar issue a while ago and it appeared as if nutch was using the "boost" field in the lucene index as field norm. It also appeared like the boost field was being set by the OPIC scoring filter which is like pagerank. In our case it was a forum which had a lot of "navigational links" that gave very high boosts to pages with a lot of navigation -- pushing "content" pages very far down in the search results. Things started looking much better when we wrote a custom scoring filter that ignored OPIC and set the boost of each document to 1.0.

Note that I'm saying "appeared" quite a few times above because we didn't trace this all the way down, just noticed that it went away when we equalized the boost field. Your mileage may vary.

If you want to create your own scoring filter, just clone the standard nutch OPICScoringFilter and change this method:

        public float indexerScore(Text url, Document doc, CrawlDatum dbDatum,
                        CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, 
float initScore)
                        throws ScoringFilterException {

To return 1.0f for every document.

Hope it helps,

Jasper

On Nov 16, 2007, at 10:26 AM, Sathyam Y wrote:

I am seeing some issues with search quality for an intranet crawl I am working on, and the problem seems to be related to fieldNorm value

I have two documents both of comparable size, but the fieldNorm for 'content' field of one document is significantly lower (2.4414062E-4) :

Here is the output from explain: The second document has a much higher termFreq but is ranking lower

   0.12035767 = (MATCH) sum of:

      0.12035767 = (MATCH) weight(content:nfl in 1234), product of:

         0.1656037 = queryWeight(content:nfl), product of:

            6.644858 = idf(docFreq=5)

            0.024922082 = queryNorm



0.7267813 = (MATCH) fieldWeight(content:nfl in 1234), product of:

            1.0 = tf(termFreq(content:nfl)=1)

            6.644858 = idf(docFreq=5)

            0.109375 = fieldNorm(field=content, doc=1234)



----------------------------------------
   0.010692856 = (MATCH) sum of:

      0.0032762752 = (MATCH) weight(url:nfl^4.0 in 796), product of:

         0.73151344 = queryWeight(url:nfl^4.0), product of:

            4.0 = boost

            7.338005 = idf(docFreq=2)

            0.024922082 = queryNorm



0.004478763 = (MATCH) fieldWeight(url:nfl in 796), product of:

            1.0 = tf(termFreq(url:nfl)=1)

            7.338005 = idf(docFreq=2)

            6.1035156E-4 = fieldNorm(field=url, doc=796)




      8.495634E-4 = (MATCH) weight(content:nfl in 796), product of:

      0.1656037 = queryWeight(content:nfl), product of:

         6.644858 = idf(docFreq=5)

         0.024922082 = queryNorm



0.005130099 = (MATCH) fieldWeight(content:nfl in 796), product of:

         3.1622777 = tf(termFreq(content:nfl)=10)

         6.644858 = idf(docFreq=5)

         2.4414062E-4 = fieldNorm(field=content, doc=796)



I added some debugs to the indexer and found that the second document has a lengthNorm of 0.02793 vs 0.03162 for the first document . Why is the fieldNorm order of magnitude lower? Are there any other factors that impacts the fieldNorm?





---------------------------------
Never miss a thing.   Make Yahoo your homepage.

Reply via email to