Re: idf^2

Paul Elschot Sun, 17 Oct 2004 03:16:03 -0700

Chuck,

I'm working on a more structured query language with
the usual explicit boolean and distance operators.


For the scoring I'm still in mostly the design stage.
I'm considering to drop the idf altogether for several reasons.

One reason is the issue of very low frequency terms (typo's in full
text of articles and pdf's for example) that get into queries via prefix
terms and fuzzy terms. This is also occasionaly reported  in the IR literature
but I don't have a reference handy.

Another reason is that the users of a structured query language
have their own ideas about the importance of terms. They take
the effort to relate their terms with the operators, and the
concept of idf doesn't seem to fit in there.

Also, I'm not aware of research results on the combination of
idf, structured query languages and full text search. I have the
impression that idf is good for text like queries on abstracts,
and that the document length and document term
frequency are more important than idf for searching in full text.

Finally, in my case, typically a classification based limitation is
used that reduces the searched documents to about 0.5% - 1.5%
of the total number of documents available, leaving the idf less
accurate.

Any comments?

...
>However, Lucene is a very good
>search engine and it seems right that it would have a "best of class"
>scoring formula out of the box.

I fully agree. It's also a very good environment for designing/implementing
another query language, although that by itself turns out to be
harder than I anticipated...


Regards,
Paul Elschot


On Sunday 17 October 2004 02:22, Chuck Williams wrote:
> Doug Cutting wrote:
> > If someone can demonstrate that an alternate formulation produces
> > superior results for most applications, then we should of course
>
> change
>
> > the default implementation.  But just noting that there's a factor
>
> which
>
> > is equal to idf^2 in each element of the sum does not do this.
>
> I researched the idf^2 issue further and believe that empirical studies
> have consistently concluded that one idf factor should be dropped.
> Salton, the originator of the IR vector space model, decided to drop the
> idf term on documents in order to avoid the squaring.  I hear he did
> this after studying recall and precision for many variations of his
> formula.  Here is a quote from his Trec-3 paper, which references the
> same comment in his Trec-1 and Trec-2 papers.  Note the final sentence:
>
...
>    To allow a meaningful final retrieval similarity, it is convenient to
> use
>    a length normalization factor as part of the term weighting formula.
> A
>    high-quality term weighting formula for wik, the weight of term Tk
>    in query Qi is
>
>      wik= [ log(fik) + 1.0) * log(N/nk) ] /
>           sqrt(sum(j=1, t) [(log(fij) + 1.0) * log(N/nj)]^2)        (1)
>
>    where fik is the occurrence frequency of Tk in Qi, N is the
> collection
>    size, and nk is the number of documents with term Tk assigned. The
> factor
>    log(N/nk) is an inverse collection frequency ("idf") factor which
>    decreases as terms are used widely in a collection, and the
> denominator
>    in expression (1) is used for weight normalization. This particular
> form
>    will be called "ltc" weighting within this paper.
>
>    The weights assigned to terms in documents are much the same. In
> practice,
>    for both effectiveness and efficiency reasons the idf factor in the
>    documents is dropped.[2, 1]"
>
> Similarly, another successful scoring formula that has been extensively
> tuned through empirical studies, OKAPI, uses idf linearly and not
> quadratically.  I checked with some friends that are more expert in this
> area than me, Edwin Cooper the founder and Chief Scientist at InQuira,
> and his father Bill Cooper, a pioneer in IR and professor emeritus at
> Berkeley.  Both of them report that squaring the idf term seems
> "strange" and is not consistent with the best known scoring formulas.
>
> At InQuira, we did extensive empirical tests for relevance in many
> different domains.  Our situation was different as the product retrieves
> specific passages so our model was much more complex (extracting
> sentences or paragraphs from a large corpus that specifically answer a
> natural language question -- e.g. try a question like "what are roth vs
> regular iras" in the search box at www.bankofamerica.com).  However, we
> did include document relevance factors including tfidf and did not
> square the idf.  None of our testing indicated that would have been an
> improvement, although we did not explicitly try it.
>
...
> Chuck
>
> > -----Original Message-----
> > From: Doug Cutting [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, October 13, 2004 9:25 AM
> > To: Lucene Developers List
> > Subject: Re: Contribution: better multi-field searching
> >
> > Paul Elschot wrote:
> > >>Did you see my IDF question at the bottom of the original note?  I'm
> > >>really curious why the square of IDF is used for Term and Phrase
> > >>queries, rather than just IDF.  It seems like it might be a bug?
> > >
> > > I missed that.
> > > It has been discussed recently, but I don't remember the outcome,
> > > perhaps some else?
> >
> > This has indeed been discussed before.
> >
> > Lucene computes a dot-product of a query vector and each document
> > vector.  Weights in both vectors are normalized tf*idf, i.e.,
> > (tf*idf)/length.  The dot product of vectors d and q is:
> >
> >    score(d,q) =  sum over t of ( weight(t,q) * weight(t,d) )
> >
> > Given this formulation, and the use of tf*idf weights, each component
>
> of
>
> > the sum has an idf^2 factor.  That's just the way it works with dot
> > products of tf*idf/length vectors.  It's not a bug.  If folks don't
>
> like
>
> > it they can simply override Similarity.idf() to return sqrt(super()).
> >
> > If someone can demonstrate that an alternate formulation produces
> > superior results for most applications, then we should of course
>
> change
>
> > the default implementation.  But just noting that there's a factor
>
> which
>
> > is equal to idf^2 in each element of the sum does not do this.
> >
> > Doug
> >
...


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: idf^2

Reply via email to