RE: Normalization of Documents

Halï¿½csy Pï¿½ter Thu, 11 Apr 2002 08:08:04 -0700

> -----Original Message-----
> From: Peter Carlson [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, April 11, 2002 4:35 PM
> To: Lucene Users List
> Subject: Re: Normalization of Documents
> 
> 
> Hi,
> 
> These types of questions/discussions should be on the users 
> list, not dev
> list, please.
> 
OK

> 
> Just for the record, the Lucene scoring is not as simple as just a %.
> From the FAQ.
> 
> For the record, Lucene's scoring algorithm is, roughly:
> 
>   score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t)
What I would like:

  score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) * p_value_d

where:
p_value_d : predefined value of document calculated at indexing time (0 < p_value_d <= 
1)

in the API:
option 1:
writer = new IndexWriter(..)
writer.addDocument(doc, 0.45);

option 2 (I think better)

Document d = new Document();
d.setValue(0.45);
d.addField(..);
writer.addDocument();

peter



>  
> where:
>   score_d   : score for document d
>   sum_t     : sum for all terms t
>   tf_q      : the square root of the frequency of t in the query
>   tf_d      : the square root of the frequency of t in d
>   idf_t     : log(numDocs/docFreq_t+1) + 1.0
>   numDocs   : number of documents in index
>   docFreq_t : number of documents containing t
>   norm_q    : sqrt(sum_t((tf_q*idf_t)^2))
>   norm_d_t  : square root of number of tokens in d in the 
> same field as t
> 
> (I hope that's right!)
> 
> [Doug later added...]
> 
> Make that:
>   
>   score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / 
> norm_d_t * boost_t)
> * coord_q_d
> 
> where
> 
>   boost_t    : the user-specified boost for term t
>   coord_q_d  : number of terms in both query and document / 
> number of terms
> in query
> 
> The coordination factor gives an AND-like boost to documents 
> that contain,
> e.g., all three terms in a three word query over those that 
> contain just two
> of the words.
> 
> 
> 
> Although this may still not be what you want. You should be 
> able to replace
> the scoring mechanism with your own. The problem you might run into is
> getting the document data (such as date) will slow down your 
> search speed
> dramatically.
> 
> Do you know about any solutions (Academic or free) that 
> provide this concept
> extraction. I've heard of a group in the UK who worked on 
> something like
> this.
> 
> --Peter
> 
> 
> 
> On 4/11/02 6:51 AM, "Halï¿½csy Pï¿½ter" <[EMAIL PROTECTED]> wrote:
> 
> > Extracting concept is not too easy thing and I don't think 
> you can implement a
> > language/context/document type independent solution. 
> Filtering only important
> > terms of a text (and not index all text as in modern full 
> text indexing
> > system) is one of the most important area of IR. A lot of 
> project worked on
> > this topic but nowadays it's not too important because we 
> can index every
> > terms if we want (cheaper and faster disk, lot of CPU).
> > 
> > I think in lucene the the term's % of the document
> > (NUMBER_OF_WORDS_IN_THE_DOCUMENT / 
> NUMBER_OF_QUERY_TERM_ACCURENCE )is
> > overweighted in some case. I would like to tune it if I could.
> > 
> > Document scoring could provide solution for me and I think 
> for Melissa as
> > well. I think it's a very important feature of a modern IR 
> system. For example
> > Melissa would use it to score the documents based on link 
> popularity (or
> > impact factor/citation frequency). In my project I should 
> score documents on
> > their length and their age (more recent document is more 
> valuable and very old
> > documents are as valuable as very new in my archive).
> > 
> > peter
> > 
> >> -----Original Message-----
> >> From: Peter Carlson [mailto:[EMAIL PROTECTED]]
> >> Sent: Wednesday, April 10, 2002 5:17 PM
> >> To: Lucene Developers List
> >> Subject: Re: Normalization of Documents
> >> 
> >> 
> >> I have noticed the same issue.
> >> 
> >> From what I understand, this is both the way it should work
> >> and a problem.
> >> Shorter documents which have a given term, should be more
> >> relevant because
> >> more of the document is about that term (i.e the term takes a
> >> greater % of
> >> the document). However, when there are documents of
> >> completely different
> >> sizes (i.e. 20 words vs. 2000 words) this assumption doesn't
> >> hold up very
> >> well.
> >> 
> >> One solution I've heard is to extract the concepts of the
> >> documents, then
> >> search on those. The concepts are still difficult to extract
> >> if the document
> >> is too short, but it may provide a way to standardize
> >> documents. I have been
> >> lazily looking for an open source, academic concept
> >> extractor, but I haven't
> >> found one. There are companies like Semio and
> >> ActiveNavigation which provide
> >> this service for an expense fee.
> >> 
> >> Let me know if you find anything or have other ideas.
> >> 
> >> --Peter
> >> 
> >> 
> >> On 4/9/02 10:07 PM, "Melissa Mifsud" 
> <[EMAIL PROTECTED]> wrote:
> >> 
> >>> Hi,
> >>> 
> >>> Documents which are shorter in length always seem to score
> >> higher in Lucene. I
> >>> was under the impression that the nornalization factors in
> >> the scoring
> >>> function used by Lucene would improve this, however, after
> >> a couple of
> >>> experiments, the short documents still always score the highest.
> >>> 
> >>> Does anyone have any ideas as to how it is possible to make
> >> lengthier
> >>> documents score higher?
> >>> 
> >>> Also, I would like a way to boost documents according to
> >> the amount of
> >>> in-links this document has.
> >>> 
> >>> Has anyone implemented a type of Document.setBoost() method?
> >>> 
> >>> I found a thread in the lucene-dev mailinglist where Doug
> >> Cutting mentions
> >>> that this would be a great feature to add to Lucene. No one
> >> followed his
> >>> email.
> >>> 
> >>> Melissa.
> >>> 
> >> 
> >> 
> >> --
> >> To unsubscribe, e-mail:
> >> <mailto:[EMAIL PROTECTED]>
> >> For additional commands, e-mail:
> >> <mailto:[EMAIL PROTECTED]>
> >> 
> >> 
> > 
> > --
> > To unsubscribe, e-mail:   
> <mailto:[EMAIL PROTECTED]>
> > For additional commands, e-mail: 
> <mailto:[EMAIL PROTECTED]>
> > 
> > 
> 
> 
> --
> To unsubscribe, e-mail:   
> <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail: 
> <mailto:[EMAIL PROTECTED]>
> 
> 

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
RE: Normalization of Documents

Reply via email to