Re: Nutch scoring algorithm

Kannan Sundaramoorthy Mon, 11 Apr 2005 22:51:20 -0700

Hi,
Thanks for the explanation. I need some more info. I understand that
fieldNorm is byte-encoded normalization factor for the named field of
every document. This value is returned by norms(String field) of
SegmentReader class. Is this normalization factor calculated at index
time itself and is just read during searching?


I used explain to see boosts for different fields.  Please see the
details below. As I see from explanation, "url" field is assigned a
boost of 4.0 and "anchor" field is assigned a boost of 2.0. Please
suggest me how I can alter boost values for different fields. Does it
need any configuration change during indexing itself?

        page

        * docNo = 1
        * segment = 20050411183746
        * digest = 3835653251e4598bee61618b1c64804c
        * boost = 1.8572323
        * lastModified = 1113224620000
        * contentLength = 347
        * primaryType = text
        * subType = html
        * url = http://localhost:8080/none.html
        * title = None Document

        score for query: none

        * 1.5042199 = sum of:
                o 0.4181689 = weight(url:none^4.0 in 1), product of:
                        + 0.8728715 = queryWeight(url:none^4.0), product        
                of:
                        # 4.0 = boost
                        # 1.9162908 = idf(docFreq=1)
                        # 0.11387514 = queryNorm
                        + 0.4790727 = fieldWeight(url:none in 1),               
        product of:
                        # 1.0 = tf(termFreq(url:none)=1)
                        # 1.9162908 = idf(docFreq=1)
                        # 0.25 = fieldNorm(field=url, doc=1)
                o 1.0349152 = weight(anchor:none^2.0 in 1), product of:
                        + 0.43643576 = queryWeight(anchor:none^2.0),            
        product of:
                        # 2.0 = boost
                        # 1.9162908 = idf(docFreq=1)
                        # 0.11387514 = queryNorm
                        + 2.3712888 = fieldWeight(anchor:none in 1),            
        product of:
                        # 1.4142135 = tf(termFreq(anchor:none)=2)
                        # 1.9162908 = idf(docFreq=1)
                        # 0.875 = fieldNorm(field=anchor, doc=1)
                o 0.05113577 = weight(content:none in 1), product of:
                        + 0.21821788 = queryWeight(content:none),               
        product of:
                        # 1.9162908 = idf(docFreq=1)
                        # 0.11387514 = queryNorm
                        + 0.23433356 = fieldWeight(content:none in 1),          
        product of:
                        # 2.236068 = tf(termFreq(content:none)=5)
                        # 1.9162908 = idf(docFreq=1)
                        # 0.0546875 = fieldNorm(field=content, doc=1)

Thanks,
Kannan
On Mon, 2005-04-11 at 17:47, Andy Liu wrote:
> fieldNorm is lengthNorm * document boost.  The lengthNorm formula is
> defined within Lucene's similarity class (which is a function of the
> number of terms within the document) and the document boost is
> calculated in IndexSegment.java .
> 
> Nutch assigns different boosts to each field so that you can tune your
> search results.  For example, you can use explain to see if anchor
> matches are too strong, and adjust accordingly.
> 
> Andy
> 
> On Apr 11, 2005 12:17 AM, Kannan Sundaramoorthy
> <[EMAIL PROTECTED]> wrote:
> > 
> > Hi,
> > I am trying to understand how Nutch computes score for each document. I
> > could figure out how tf, idf and queryNorm are computed but I do not
> > understand how fieldNorm (normalisation for each field) value is
> > computed. It seems to be a magic number for me and this is where Nutch
> > seems to differ from Lucene in computing score.
> > Also Nutch assigns different boosts for different fields (e.g, 4.0 for
> > url field) and uses this value while computing queryWeight. Can anyone
> > explain these please?
> > 
> > Thanks,
> > Kannan
> > 
> > This e-mail and any files transmitted with it are for the sole use of the 
> > intended recipient(s) and may contain confidential and privileged 
> > information.
> > If you are not the intended recipient, please contact the sender by reply 
> > e-mail and destroy all copies of the original message.
> > Any unauthorised review, use, disclosure, dissemination, forwarding, 
> > printing or copying of this email or any action taken in reliance on this 
> > e-mail is strictly
> > prohibited and may be unlawful.
> > 
> >   Visit us at http://www.cognizant.com
> >


This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient, please contact the sender by reply 
e-mail and destroy all copies of the original message. 
Any unauthorised review, use, disclosure, dissemination, forwarding, printing 
or copying of this email or any action taken in reliance on this e-mail is 
strictly 
prohibited and may be unlawful.

  Visit us at http://www.cognizant.com

Re: Nutch scoring algorithm

Reply via email to