Re: Nutch scoring algorithm

Andy Liu Tue, 12 Apr 2005 05:40:39 -0700

You're correct, the fieldNorm is calculated at index time.

The URL and anchor boosts are query time boosts that can be changed
within the query-basic plugin.  There's been talk about ripping these
out and placing them in the conf file, but I'm not sure if it's been
done yet.


On Apr 12, 2005 1:50 AM, Kannan Sundaramoorthy
<[EMAIL PROTECTED]> wrote:
> 
> Hi,
> Thanks for the explanation. I need some more info. I understand that
> fieldNorm is byte-encoded normalization factor for the named field of
> every document. This value is returned by norms(String field) of
> SegmentReader class. Is this normalization factor calculated at index
> time itself and is just read during searching?
> 
> I used explain to see boosts for different fields.  Please see the
> details below. As I see from explanation, "url" field is assigned a
> boost of 4.0 and "anchor" field is assigned a boost of 2.0. Please
> suggest me how I can alter boost values for different fields. Does it
> need any configuration change during indexing itself?
> 
>         page
> 
>         * docNo = 1
>         * segment = 20050411183746
>         * digest = 3835653251e4598bee61618b1c64804c
>         * boost = 1.8572323
>         * lastModified = 1113224620000
>         * contentLength = 347
>         * primaryType = text
>         * subType = html
>         * url = http://localhost:8080/none.html
>         * title = None Document
> 
>         score for query: none
> 
>         * 1.5042199 = sum of:
>                 o 0.4181689 = weight(url:none^4.0 in 1), product of:
>                         + 0.8728715 = queryWeight(url:none^4.0), product      
>                   of:
>                         # 4.0 = boost
>                         # 1.9162908 = idf(docFreq=1)
>                         # 0.11387514 = queryNorm
>                         + 0.4790727 = fieldWeight(url:none in 1),             
>           product of:
>                         # 1.0 = tf(termFreq(url:none)=1)
>                         # 1.9162908 = idf(docFreq=1)
>                         # 0.25 = fieldNorm(field=url, doc=1)
>                 o 1.0349152 = weight(anchor:none^2.0 in 1), product of:
>                         + 0.43643576 = queryWeight(anchor:none^2.0),          
>           product of:
>                         # 2.0 = boost
>                         # 1.9162908 = idf(docFreq=1)
>                         # 0.11387514 = queryNorm
>                         + 2.3712888 = fieldWeight(anchor:none in 1),          
>           product of:
>                         # 1.4142135 = tf(termFreq(anchor:none)=2)
>                         # 1.9162908 = idf(docFreq=1)
>                         # 0.875 = fieldNorm(field=anchor, doc=1)
>                 o 0.05113577 = weight(content:none in 1), product of:
>                         + 0.21821788 = queryWeight(content:none),             
>           product of:
>                         # 1.9162908 = idf(docFreq=1)
>                         # 0.11387514 = queryNorm
>                         + 0.23433356 = fieldWeight(content:none in 1),        
>           product of:
>                         # 2.236068 = tf(termFreq(content:none)=5)
>                         # 1.9162908 = idf(docFreq=1)
>                         # 0.0546875 = fieldNorm(field=content, doc=1)
> 
> Thanks,
> Kannan
> On Mon, 2005-04-11 at 17:47, Andy Liu wrote:
> > fieldNorm is lengthNorm * document boost.  The lengthNorm formula is
> > defined within Lucene's similarity class (which is a function of the
> > number of terms within the document) and the document boost is
> > calculated in IndexSegment.java .
> >
> > Nutch assigns different boosts to each field so that you can tune your
> > search results.  For example, you can use explain to see if anchor
> > matches are too strong, and adjust accordingly.
> >
> > Andy
> >
> > On Apr 11, 2005 12:17 AM, Kannan Sundaramoorthy
> > <[EMAIL PROTECTED]> wrote:
> > >
> > > Hi,
> > > I am trying to understand how Nutch computes score for each document. I
> > > could figure out how tf, idf and queryNorm are computed but I do not
> > > understand how fieldNorm (normalisation for each field) value is
> > > computed. It seems to be a magic number for me and this is where Nutch
> > > seems to differ from Lucene in computing score.
> > > Also Nutch assigns different boosts for different fields (e.g, 4.0 for
> > > url field) and uses this value while computing queryWeight. Can anyone
> > > explain these please?
> > >
> > > Thanks,
> > > Kannan
> > >
> > > This e-mail and any files transmitted with it are for the sole use of the 
> > > intended recipient(s) and may contain confidential and privileged 
> > > information.
> > > If you are not the intended recipient, please contact the sender by reply 
> > > e-mail and destroy all copies of the original message.
> > > Any unauthorised review, use, disclosure, dissemination, forwarding, 
> > > printing or copying of this email or any action taken in reliance on this 
> > > e-mail is strictly
> > > prohibited and may be unlawful.
> > >
> > >   Visit us at http://www.cognizant.com
> > >
> 
> This e-mail and any files transmitted with it are for the sole use of the 
> intended recipient(s) and may contain confidential and privileged information.
> If you are not the intended recipient, please contact the sender by reply 
> e-mail and destroy all copies of the original message.
> Any unauthorised review, use, disclosure, dissemination, forwarding, printing 
> or copying of this email or any action taken in reliance on this e-mail is 
> strictly
> prohibited and may be unlawful.
> 
>   Visit us at http://www.cognizant.com
>

Re: Nutch scoring algorithm

Reply via email to