Structured queries and flexible scoring. Was RE: idf^2

Chuck Williams Sun, 17 Oct 2004 10:44:18 -0700

Paul,

One more thing.  After writing all of that I didn't answer your direct
question!


I think that idf has proven to be valuable in almost every empirically
tested document relevance formula.  So rather than throwing it out,
perhaps the solution for typos showing up in queries with term wildcards
is to ignore idf only for terms generated by such wildcard query
components.  Or maybe to weight the entire component based on an average
of the idf's of the terms it generates (which works well if query idf's
are used but document idf's are not, as Salton has done).  Or something
like that.  A flexible scoring capability should allow these kinds of
query-specific variations as well.

Chuck

> -----Original Message-----
> From: Chuck Williams
> Sent: Sunday, October 17, 2004 9:31 AM
> To: 'Lucene Developers List'
> Subject: RE: idf^2
> 
> Paul,
> 
> I'm glad to hear you are doing this -- it sounds like a great project.
> After sending the message below, I was thinking about the possibility
of a
> flexible scoring architecture that would allow application developers
to
> easily customize scoring to their needs.  One simple thing I'm missing
in
> the current Similarity mechanism is a parameter that identifies the
field
> for which the various factors are being computed.  For various
reasons,
> I'd like to use different computations for different fields in my
> application.
> 
> Probabilistic scoring models generally use some kind of training or
> learning mechanism to induce optimal coefficients and additive factors
for
> combining a variety of score components (including tf, idf, field
length
> and others -- e.g., OKAPI uses a term related to but different from
idf:
> the ratio of number of docs not containing a term to the number of
docs
> that do contain the term).  It would be great to have a scoring
mechanism
> that defined an extensible library of standard score components and
> provided the flexibility to combine them arbitrarily, and to use
training
> to induce an optimal combining function.  The probabilistic models
have a
> couple significant advantages:
>   1.  An application developer can optimize relevance by using
> training/learning techniques.
>   2.  Scores are true probabilities that reflect the likelihood a
result
> is relevant.  This allows absolute comparability across searches and
> various thresholding, segregating and other UI devices to communicate
the
> quality of results to the user.
> 
> I would love to see a flexible scoring mechanism to go along with a
more
> structured query mechanism in Lucene, and would be interested in
helping
> to create this.  Within this mechanism, the default scoring formula
should
> be one of the best empirically tested and tuned variations, e.g.,
Salton's
> final SMART formula or Robertson's OKAPI.
> 
> Re. combining idf with structured query languages, I believe a number
of
> people do this.  Let me describe what we did at InQuira briefly.  (I
use
> the past tense here only because I've left the company.)  The user
query
> language was either traditional keyword (with standard phrase, +/-,
etc.
> operators) or a natural language question.  We used a front-end rule
> system to rewrite the user query into a structured back-end query
language.
> Those rules contain many heuristics, especially relative to the
natural
> language queries (e.g., "when" translates into a date search, "how
much"
> into a quantity or currency search, etc.).  The back-end query
language
> has the normal Boolean, distance, etc., operations, along with some
less
> common ones (e.g., we had an ontology, indexed concepts, and could do
> concept searches).  Parts of the scoring were fixed while other parts
were
> flexible:  e.g., there were three macro-components that could be
weighted
> per-application or even per-query.  The macro-components are excerpt-
> relevance (how likely a sentence of paragraph is to precisely answer
the
> question), document relevance (tfidf is a major factor here, along
with
> page rank, etc.), and recency.  Re. per-query weighting, consider that
> current events oriented queries generally emphasize recency, topical
> queries emphasize document relevance and authority (tfidf and page
rank),
> and specific questions emphasize excerpt relevance.
> 
> Through empirical testing we found that the optimal relationship
between
> the macro-score components was one of tie-breaking.  E.g., tfidf was a
> good tie-breaker among results that appeared to have equally good
excerpts.
> 
> I don't think Lucene should go in the question-answering direction,
but do
> think it should support a range of useful scoring components that can
be
> tuned and combined to yield great relevancy.  E.g., page rank is an
> extremely valuable factor in web search, and in any domain where this
is a
> rich hyper-link structure.  Having the flexibility in Lucene to easily
> incorporate this into scoring for domains where it is available and
useful
> would be great.
> 
> If you are interested in creating a flexible scoring architecture to
go
> along with your new query mechanism and I could help, please let me
know.
> 
> Chuck
> 
> > -----Original Message-----
> > From: Paul Elschot [mailto:[EMAIL PROTECTED]
> > Sent: Sunday, October 17, 2004 3:15 AM
> > To: Lucene Developers List
> > Subject: Re: idf^2
> >
> > Chuck,
> >
> > I'm working on a more structured query language with
> > the usual explicit boolean and distance operators.
> >
> > For the scoring I'm still in mostly the design stage.
> > I'm considering to drop the idf altogether for several reasons.
> >
> > One reason is the issue of very low frequency terms (typo's in full
> > text of articles and pdf's for example) that get into queries via
prefix
> > terms and fuzzy terms. This is also occasionaly reported  in the IR
> > literature
> > but I don't have a reference handy.
> >
> > Another reason is that the users of a structured query language
> > have their own ideas about the importance of terms. They take
> > the effort to relate their terms with the operators, and the
> > concept of idf doesn't seem to fit in there.
> >
> > Also, I'm not aware of research results on the combination of
> > idf, structured query languages and full text search. I have the
> > impression that idf is good for text like queries on abstracts,
> > and that the document length and document term
> > frequency are more important than idf for searching in full text.
> >
> > Finally, in my case, typically a classification based limitation is
> > used that reduces the searched documents to about 0.5% - 1.5%
> > of the total number of documents available, leaving the idf less
> > accurate.
> >
> > Any comments?
> >
> > ...
> > >However, Lucene is a very good
> > >search engine and it seems right that it would have a "best of
class"
> > >scoring formula out of the box.
> >
> > I fully agree. It's also a very good environment for
> > designing/implementing
> > another query language, although that by itself turns out to be
> > harder than I anticipated...
> >
> >
> > Regards,
> > Paul Elschot
> >
> >
> > On Sunday 17 October 2004 02:22, Chuck Williams wrote:
> > > Doug Cutting wrote:
> > > > If someone can demonstrate that an alternate formulation
produces
> > > > superior results for most applications, then we should of course
> > >
> > > change
> > >
> > > > the default implementation.  But just noting that there's a
factor
> > >
> > > which
> > >
> > > > is equal to idf^2 in each element of the sum does not do this.
> > >
> > > I researched the idf^2 issue further and believe that empirical
> studies
> > > have consistently concluded that one idf factor should be dropped.
> > > Salton, the originator of the IR vector space model, decided to
drop
> the
> > > idf term on documents in order to avoid the squaring.  I hear he
did
> > > this after studying recall and precision for many variations of
his
> > > formula.  Here is a quote from his Trec-3 paper, which references
the
> > > same comment in his Trec-1 and Trec-2 papers.  Note the final
> sentence:
> > >
> > ...
> > >    To allow a meaningful final retrieval similarity, it is
convenient
> to
> > > use
> > >    a length normalization factor as part of the term weighting
formula.
> > > A
> > >    high-quality term weighting formula for wik, the weight of term
Tk
> > >    in query Qi is
> > >
> > >      wik= [ log(fik) + 1.0) * log(N/nk) ] /
> > >           sqrt(sum(j=1, t) [(log(fij) + 1.0) * log(N/nj)]^2)
> (1)
> > >
> > >    where fik is the occurrence frequency of Tk in Qi, N is the
> > > collection
> > >    size, and nk is the number of documents with term Tk assigned.
The
> > > factor
> > >    log(N/nk) is an inverse collection frequency ("idf") factor
which
> > >    decreases as terms are used widely in a collection, and the
> > > denominator
> > >    in expression (1) is used for weight normalization. This
particular
> > > form
> > >    will be called "ltc" weighting within this paper.
> > >
> > >    The weights assigned to terms in documents are much the same.
In
> > > practice,
> > >    for both effectiveness and efficiency reasons the idf factor in
the
> > >    documents is dropped.[2, 1]"
> > >
> > > Similarly, another successful scoring formula that has been
> extensively
> > > tuned through empirical studies, OKAPI, uses idf linearly and not
> > > quadratically.  I checked with some friends that are more expert
in
> this
> > > area than me, Edwin Cooper the founder and Chief Scientist at
InQuira,
> > > and his father Bill Cooper, a pioneer in IR and professor emeritus
at
> > > Berkeley.  Both of them report that squaring the idf term seems
> > > "strange" and is not consistent with the best known scoring
formulas.
> > >
> > > At InQuira, we did extensive empirical tests for relevance in many
> > > different domains.  Our situation was different as the product
> retrieves
> > > specific passages so our model was much more complex (extracting
> > > sentences or paragraphs from a large corpus that specifically
answer a
> > > natural language question -- e.g. try a question like "what are
roth
> vs
> > > regular iras" in the search box at www.bankofamerica.com).
However,
> we
> > > did include document relevance factors including tfidf and did not
> > > square the idf.  None of our testing indicated that would have
been an
> > > improvement, although we did not explicitly try it.
> > >
> > ...
> > > Chuck
> > >
> > > > -----Original Message-----
> > > > From: Doug Cutting [mailto:[EMAIL PROTECTED]
> > > > Sent: Wednesday, October 13, 2004 9:25 AM
> > > > To: Lucene Developers List
> > > > Subject: Re: Contribution: better multi-field searching
> > > >
> > > > Paul Elschot wrote:
> > > > >>Did you see my IDF question at the bottom of the original
note?
> I'm
> > > > >>really curious why the square of IDF is used for Term and
Phrase
> > > > >>queries, rather than just IDF.  It seems like it might be a
bug?
> > > > >
> > > > > I missed that.
> > > > > It has been discussed recently, but I don't remember the
outcome,
> > > > > perhaps some else?
> > > >
> > > > This has indeed been discussed before.
> > > >
> > > > Lucene computes a dot-product of a query vector and each
document
> > > > vector.  Weights in both vectors are normalized tf*idf, i.e.,
> > > > (tf*idf)/length.  The dot product of vectors d and q is:
> > > >
> > > >    score(d,q) =  sum over t of ( weight(t,q) * weight(t,d) )
> > > >
> > > > Given this formulation, and the use of tf*idf weights, each
> component
> > >
> > > of
> > >
> > > > the sum has an idf^2 factor.  That's just the way it works with
dot
> > > > products of tf*idf/length vectors.  It's not a bug.  If folks
don't
> > >
> > > like
> > >
> > > > it they can simply override Similarity.idf() to return
sqrt(super()).
> > > >
> > > > If someone can demonstrate that an alternate formulation
produces
> > > > superior results for most applications, then we should of course
> > >
> > > change
> > >
> > > > the default implementation.  But just noting that there's a
factor
> > >
> > > which
> > >
> > > > is equal to idf^2 in each element of the sum does not do this.
> > > >
> > > > Doug
> > > >
> > ...
> >
> >
> >
---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Structured queries and flexible scoring. Was RE: idf^2

Reply via email to