Re: very low fieldnorm leading to bad results

Jasper Kamperman Fri, 16 Nov 2007 10:52:37 -0800

We ran into a similar issue a while ago and it appeared as if nutchwas using the "boost" field in the lucene index as field norm. Italso appeared like the boost field was being set by the OPIC scoringfilter which is like pagerank. In our case it was a forum which had alot of "navigational links" that gave very high boosts to pages witha lot of navigation -- pushing "content" pages very far down in thesearch results. Things started looking much better when we wrote acustom scoring filter that ignored OPIC and set the boost of eachdocument to 1.0.

Note that I'm saying "appeared" quite a few times above because wedidn't trace this all the way down, just noticed that it went awaywhen we equalized the boost field. Your mileage may vary.

If you want to create your own scoring filter, just clone thestandard nutch OPICScoringFilter and change this method:


        public float indexerScore(Text url, Document doc, CrawlDatum dbDatum,
                        CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, 
float initScore)
                        throws ScoringFilterException {

To return 1.0f for every document.

Hope it helps,

Jasper

On Nov 16, 2007, at 10:26 AM, Sathyam Y wrote:

I am seeing some issues with search quality for an intranet crawl Iam working on, and the problem seems to be related to fieldNorm value
I have two documents both of comparable size, but the fieldNormfor 'content' field of one document is significantly lower(2.4414062E-4) :
Here is the output from explain: The second document has a muchhigher termFreq but is ranking lower
   0.12035767 = (MATCH) sum of:

      0.12035767 = (MATCH) weight(content:nfl in 1234), product of:

         0.1656037 = queryWeight(content:nfl), product of:

            6.644858 = idf(docFreq=5)

            0.024922082 = queryNorm
0.7267813 = (MATCH) fieldWeight(content:nfl in 1234),product of:
            1.0 = tf(termFreq(content:nfl)=1)

            6.644858 = idf(docFreq=5)

            0.109375 = fieldNorm(field=content, doc=1234)



----------------------------------------
   0.010692856 = (MATCH) sum of:

      0.0032762752 = (MATCH) weight(url:nfl^4.0 in 796), product of:

         0.73151344 = queryWeight(url:nfl^4.0), product of:

            4.0 = boost

            7.338005 = idf(docFreq=2)

            0.024922082 = queryNorm
0.004478763 = (MATCH) fieldWeight(url:nfl in 796), productof:
            1.0 = tf(termFreq(url:nfl)=1)

            7.338005 = idf(docFreq=2)

            6.1035156E-4 = fieldNorm(field=url, doc=796)




      8.495634E-4 = (MATCH) weight(content:nfl in 796), product of:

      0.1656037 = queryWeight(content:nfl), product of:

         6.644858 = idf(docFreq=5)

         0.024922082 = queryNorm
0.005130099 = (MATCH) fieldWeight(content:nfl in 796),product of:
         3.1622777 = tf(termFreq(content:nfl)=10)

         6.644858 = idf(docFreq=5)

         2.4414062E-4 = fieldNorm(field=content, doc=796)
I added some debugs to the indexer and found that the seconddocument has a lengthNorm of 0.02793 vs 0.03162 for the firstdocument . Why is the fieldNorm order of magnitude lower? Are thereany other factors that impacts the fieldNorm?
---------------------------------
Never miss a thing.   Make Yahoo your homepage.

Re: very low fieldnorm leading to bad results

Reply via email to