> -----Original Message-----
> From: Peter Carlson [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, April 11, 2002 4:35 PM
> To: Lucene Users List
> Subject: Re: Normalization of Documents
>
>
> Hi,
>
> These types of questions/discussions should be on the users
> list, not dev
> list, please.
>
OK
>
> Just for the record, the Lucene scoring is not as simple as just a %.
> From the FAQ.
>
> For the record, Lucene's scoring algorithm is, roughly:
>
> score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t)
What I would like:
score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) * p_value_d
where:
p_value_d : predefined value of document calculated at indexing time (0 < p_value_d <=
1)
in the API:
option 1:
writer = new IndexWriter(..)
writer.addDocument(doc, 0.45);
option 2 (I think better)
Document d = new Document();
d.setValue(0.45);
d.addField(..);
writer.addDocument();
peter
>
> where:
> score_d : score for document d
> sum_t : sum for all terms t
> tf_q : the square root of the frequency of t in the query
> tf_d : the square root of the frequency of t in d
> idf_t : log(numDocs/docFreq_t+1) + 1.0
> numDocs : number of documents in index
> docFreq_t : number of documents containing t
> norm_q : sqrt(sum_t((tf_q*idf_t)^2))
> norm_d_t : square root of number of tokens in d in the
> same field as t
>
> (I hope that's right!)
>
> [Doug later added...]
>
> Make that:
>
> score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t /
> norm_d_t * boost_t)
> * coord_q_d
>
> where
>
> boost_t : the user-specified boost for term t
> coord_q_d : number of terms in both query and document /
> number of terms
> in query
>
> The coordination factor gives an AND-like boost to documents
> that contain,
> e.g., all three terms in a three word query over those that
> contain just two
> of the words.
>
>
>
> Although this may still not be what you want. You should be
> able to replace
> the scoring mechanism with your own. The problem you might run into is
> getting the document data (such as date) will slow down your
> search speed
> dramatically.
>
> Do you know about any solutions (Academic or free) that
> provide this concept
> extraction. I've heard of a group in the UK who worked on
> something like
> this.
>
> --Peter
>
>
>
> On 4/11/02 6:51 AM, "Hal�csy P�ter" <[EMAIL PROTECTED]> wrote:
>
> > Extracting concept is not too easy thing and I don't think
> you can implement a
> > language/context/document type independent solution.
> Filtering only important
> > terms of a text (and not index all text as in modern full
> text indexing
> > system) is one of the most important area of IR. A lot of
> project worked on
> > this topic but nowadays it's not too important because we
> can index every
> > terms if we want (cheaper and faster disk, lot of CPU).
> >
> > I think in lucene the the term's % of the document
> > (NUMBER_OF_WORDS_IN_THE_DOCUMENT /
> NUMBER_OF_QUERY_TERM_ACCURENCE )is
> > overweighted in some case. I would like to tune it if I could.
> >
> > Document scoring could provide solution for me and I think
> for Melissa as
> > well. I think it's a very important feature of a modern IR
> system. For example
> > Melissa would use it to score the documents based on link
> popularity (or
> > impact factor/citation frequency). In my project I should
> score documents on
> > their length and their age (more recent document is more
> valuable and very old
> > documents are as valuable as very new in my archive).
> >
> > peter
> >
> >> -----Original Message-----
> >> From: Peter Carlson [mailto:[EMAIL PROTECTED]]
> >> Sent: Wednesday, April 10, 2002 5:17 PM
> >> To: Lucene Developers List
> >> Subject: Re: Normalization of Documents
> >>
> >>
> >> I have noticed the same issue.
> >>
> >> From what I understand, this is both the way it should work
> >> and a problem.
> >> Shorter documents which have a given term, should be more
> >> relevant because
> >> more of the document is about that term (i.e the term takes a
> >> greater % of
> >> the document). However, when there are documents of
> >> completely different
> >> sizes (i.e. 20 words vs. 2000 words) this assumption doesn't
> >> hold up very
> >> well.
> >>
> >> One solution I've heard is to extract the concepts of the
> >> documents, then
> >> search on those. The concepts are still difficult to extract
> >> if the document
> >> is too short, but it may provide a way to standardize
> >> documents. I have been
> >> lazily looking for an open source, academic concept
> >> extractor, but I haven't
> >> found one. There are companies like Semio and
> >> ActiveNavigation which provide
> >> this service for an expense fee.
> >>
> >> Let me know if you find anything or have other ideas.
> >>
> >> --Peter
> >>
> >>
> >> On 4/9/02 10:07 PM, "Melissa Mifsud"
> <[EMAIL PROTECTED]> wrote:
> >>
> >>> Hi,
> >>>
> >>> Documents which are shorter in length always seem to score
> >> higher in Lucene. I
> >>> was under the impression that the nornalization factors in
> >> the scoring
> >>> function used by Lucene would improve this, however, after
> >> a couple of
> >>> experiments, the short documents still always score the highest.
> >>>
> >>> Does anyone have any ideas as to how it is possible to make
> >> lengthier
> >>> documents score higher?
> >>>
> >>> Also, I would like a way to boost documents according to
> >> the amount of
> >>> in-links this document has.
> >>>
> >>> Has anyone implemented a type of Document.setBoost() method?
> >>>
> >>> I found a thread in the lucene-dev mailinglist where Doug
> >> Cutting mentions
> >>> that this would be a great feature to add to Lucene. No one
> >> followed his
> >>> email.
> >>>
> >>> Melissa.
> >>>
> >>
> >>
> >> --
> >> To unsubscribe, e-mail:
> >> <mailto:[EMAIL PROTECTED]>
> >> For additional commands, e-mail:
> >> <mailto:[EMAIL PROTECTED]>
> >>
> >>
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:[EMAIL PROTECTED]>
> > For additional commands, e-mail:
> <mailto:[EMAIL PROTECTED]>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
> <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
> <mailto:[EMAIL PROTECTED]>
>
>
--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>