RE: idf^2

Chuck Williams Sun, 17 Oct 2004 14:14:10 -0700

Comments embedded:

> -----Original Message-----
> From: Paul Elschot [mailto:[EMAIL PROTECTED]
> Sent: Sunday, October 17, 2004 11:50 AM
> To: [EMAIL PROTECTED]
> Subject: Re: idf^2
> 
> Chuck,
> 
> On Sunday 17 October 2004 18:30, Chuck Williams wrote:
> > Paul,
> >
> > I'm glad to hear you are doing this -- it sounds like a great
project.
> > After sending the message below, I was thinking about the
possibility of
> > a flexible scoring architecture that would allow application
developers
> > to easily customize scoring to their needs.  One simple thing I'm
> > missing in the current Similarity mechanism is a parameter that
> > identifies the field for which the various factors are being
computed.
> > For various reasons, I'd like to use different computations for
> > different fields in my application.
> 
> For that you can use:
> 
> new TermQuery(new Term(fieldName, termText)) {
>   public Similarity getSimilarity(Searcher) {return yourSimilarity;}
> }
> 
[Chuck Williams]


Thanks for this tip.  As a newcomer to Lucene these things are not
always obvious.  I presume that one would need to do a similar thing
with PhraseQuery's and any other "primitive" queries if there are any.
It's great that Lucene has this flexibility, although in this case I
think that expressing a property of fields by overriding all of the
query classes that access those fields is not the ideal architecture.  A
more elegant approach would allow association of field-specific
information with the field (applying to all queries), and query-specific
information (like treating idf differently for the wildcard-like
queries) with the queries.

> >
> > Probabilistic scoring models generally use some kind of training or
> > learning mechanism to induce optimal coefficients and additive
factors
> > for combining a variety of score components (including tf, idf,
field
> > length and others -- e.g., OKAPI uses a term related to but
different
> > from idf:  the ratio of number of docs not containing a term to the
> > number of docs that do contain the term).  It would be great to have
a
> > scoring mechanism that defined an extensible library of standard
score
> > components and provided the flexibility to combine them arbitrarily,
and
> 
> Lucene comes a long way in allowing arbitrary combinations of
Similarity
> implementations.
> 
> > to use training to induce an optimal combining function.  The
> > probabilistic models have a couple significant advantages:
> >   1.  An application developer can optimize relevance by using
> > training/learning techniques.
> 
> If one has a relevant training set with a good set of real world
queries...
[Chuck Williams] 

Search logs are a great resource for this.


> 
> >   2.  Scores are true probabilities that reflect the likelihood a
result
> > is relevant.  This allows absolute comparability across searches and
> > various thresholding, segregating and other UI devices to
communicate
> > the quality of results to the user.
> 
> Lucene is already quite close to true probabilities: apart from the
> various
> weights, a TermQuery scores by the square root of the term frequence
in
> document divided by the square root of the document field length, ie.
> the score is proportional to the square root of the term  density.
[Chuck Williams] 

That's good, but it would be nice to finish the job.  It's kind of like
the problem with development that is 80% done, but 0% visible or useful.
The biggest problem I've seen in Lucene scoring is that the
normalization is not right, especially what is done in Hits.

> 
> > I would love to see a flexible scoring mechanism to go along with a
more
> 
> Lucene's Weight and Scorer also allow to override getSimilarity(),
> adding to its flexibility.
> 
> > structured query mechanism in Lucene, and would be interested in
helping
> 
> The current query parser already does quite a bit, however it does not
> allow truncation and distance to be combined (unless this was added
> recently.)
> 
> > to create this.  Within this mechanism, the default scoring formula
> > should be one of the best empirically tested and tuned variations,
e.g.,
> > Salton's final SMART formula or Robertson's OKAPI.
> 
> The difference between Lucene's scoring and Okapi is actually quite
little.
> Lucene as a coordination factor, which Okapi doesn't have.
> Okapi treats the document length somewhat differently, but in the end
> the net effect is mostly a saturation on term density, like the
> square root of Lucene. (Which looks much like an inverse power norm,
btw.)
> Lucene mostly lacks the possibility of synonyms, ie. adding the
densities
> before applying the saturation.
> 
> > Re. combining idf with structured query languages, I believe a
number of
> > people do this.  Let me describe what we did at InQuira briefly.  (I
use
> > the past tense here only because I've left the company.)  The user
query
> > language was either traditional keyword (with standard phrase, +/-,
etc.
> > operators) or a natural language question.  We used a front-end rule
> > system to rewrite the user query into a structured back-end query
> > language.  Those rules contain many heuristics, especially relative
to
> > the natural language queries (e.g., "when" translates into a date
> > search, "how much" into a quantity or currency search, etc.).  The
> > back-end query language has the normal Boolean, distance, etc.,
> > operations, along with some less common ones (e.g., we had an
ontology,
> > indexed concepts, and could do concept searches).  Parts of the
scoring
> > were fixed while other parts were flexible:  e.g., there were three
> > macro-components that could be weighted per-application or even
> > per-query.  The macro-components are excerpt-relevance (how likely a
> > sentence of paragraph is to precisely answer the question), document
> 
> Sentences and paragraphs are not directly supported in Lucene. Quite a
> while
> ago Doug mentioned some mechanism to extend an index with levels,
> which could probably be used for this, but I have not heard of it
anymore.
> OTOH one can easily define extra fields for them with different term
> positions
> than normal, so it's not too hard to implement now.
> 
> > relevance (tfidf is a major factor here, along with page rank,
etc.),
> > and recency.  Re. per-query weighting, consider that current events
> > oriented queries generally emphasize recency, topical queries
emphasize
> > document relevance and authority (tfidf and page rank), and specific
> > questions emphasize excerpt relevance.
> >
> > Through empirical testing we found that the optimal relationship
between
> > the macro-score components was one of tie-breaking.  E.g., tfidf was
a
> > good tie-breaker among results that appeared to have equally good
> > excerpts.
> 
> Would you have some background on what constitutes an excerpt here?
[Chuck Williams] 

InQuira parses a document into a tree of "scopes":  sentences,
paragraphs, sections, ...  An "answer" is the narrowest scope that
provides the information the user seeks (e.g., the answer to a natural
language question), combined with highlighting of the specific answer
components.  An "excerpt" is an answer, possibly with certain sentences
or phrases extracted if the scope is large, and possibly extended with
some additional context if the scope is small(e.g., the sentence before
and after the sentence that answers a question).  Hyperlinks on the
result list items go directly to the answer.

> 
> > I don't think Lucene should go in the question-answering direction,
but
> > do think it should support a range of useful scoring components that
can
> > be tuned and combined to yield great relevancy.  E.g., page rank is
an
> 
> The recently added term vectors can be used for standard relevance
> feedback
> mechanisms. I have not looked into these in much detail yet.
> Personally I'm a bit skeptic about parameters tuned for some training
set,
> mostly because they are not easy to explain to end users.
[Chuck Williams] 

I agree that "explaining" the relevance of results to end-users is
important in many domains, especially simple thresholding and
segregation, e.g., separating results that are likely to be relevant
from those that are likely to be irrelevant.  Again, fixing the
normalization in Lucene should achieve this straightforwardly.

> 
> > extremely valuable factor in web search, and in any domain where
this is
> > a rich hyper-link structure.  Having the flexibility in Lucene to
easily
> > incorporate this into scoring for domains where it is available and
> > useful would be great.
> 
> Nutch has implemented a page rank mechanism using Lucene.
[Chuck Williams] 

Cool.  I had guessed that Nutch was using Lucene, but hadn't seen that
explicitly stated before.

> 
> >
> > If you are interested in creating a flexible scoring architecture to
go
> > along with your new query mechanism and I could help, please let me
> > know.
> 
> I would be interested in implementing (substantial parts of) the
Inquery
> query language on top of  Lucene. Inquery is the best query
> language specification that is publicly available, AFAIK.
> However, if you know a better query language...
[Chuck Williams] 

I'm not deeply familiar with public query language specifications, but I
am in general a fan of U Mass CIIR and so would expect Inquery to be as
good as any.  I just reviewed their list of query operators and their
approach to beliefs, probabilities and inference networks.  As an old AI
guy, it looks pretty good to me.  I especially like their focus on the
information goal of the end-user; we used this same notion as the
central theme at InQuira (adding the concept of rules to rewrite the
user's surface-form query into a back-end query that looked for the
information that would satisfy the user's goal even if that was quite
different than content containing the words of his or her question).
Having specific belief-oriented combining operators (#max, #sum, #wsum
in Inquery) seems pretty useful.  Of course, the key here for almost all
domains is to do this for the user automatically.  End users don't
understand these things and shouldn't have to worry about it (which is
what the heuristic rule-based query-rewriter in InQuira is all about).

> 
> Earlier this year I posted some code on what I called the Surround
> query language on top of Lucene. It is a first attempt to combine
> proximity and truncation using the then newly introduced span queries:
>
http://www.mail-archive.com/[EMAIL PROTECTED]/msg05504.html
> It also has explicit operators for everything, but no adaptations to
the
> scoring mechanism of Lucene.
> With some hindsight it's a nice proof of concept for a structured
query
> language on top of Lucene.
[Chuck Williams] 

Sounds cool.  I'll take a look through it...  I'm looking forward to
what you produce here.  Again, if I can be of any help, please let me
know.  Most of the attention in these systems seems to go to efficiency
considerations, which are of course essential.  However, at the end of
the day it's the ability of the system to find precisely the information
the end user wants and needs and put it right on top of the result list
that really matters.  I'm glad there is more thought and focus going
into this in Lucene.

Chuck

> 
> Regards,
> Paul Elschot
> 
> >
> > Chuck
> >
> > > -----Original Message-----
> > > From: Paul Elschot [mailto:[EMAIL PROTECTED]
> > > Sent: Sunday, October 17, 2004 3:15 AM
> > > To: Lucene Developers List
> > > Subject: Re: idf^2
> > >
> > > Chuck,
> > >
> > > I'm working on a more structured query language with
> > > the usual explicit boolean and distance operators.
> > >
> > > For the scoring I'm still in mostly the design stage.
> > > I'm considering to drop the idf altogether for several reasons.
> > >
> > > One reason is the issue of very low frequency terms (typo's in
full
> > > text of articles and pdf's for example) that get into queries via
> >
> > prefix
> >
> > > terms and fuzzy terms. This is also occasionaly reported  in the
IR
> > > literature
> > > but I don't have a reference handy.
> > >
> > > Another reason is that the users of a structured query language
> > > have their own ideas about the importance of terms. They take
> > > the effort to relate their terms with the operators, and the
> > > concept of idf doesn't seem to fit in there.
> > >
> > > Also, I'm not aware of research results on the combination of
> > > idf, structured query languages and full text search. I have the
> > > impression that idf is good for text like queries on abstracts,
> > > and that the document length and document term
> > > frequency are more important than idf for searching in full text.
> > >
> > > Finally, in my case, typically a classification based limitation
is
> > > used that reduces the searched documents to about 0.5% - 1.5%
> > > of the total number of documents available, leaving the idf less
> > > accurate.
> > >
> > > Any comments?
> > >
> > > ...
> > >
> > > >However, Lucene is a very good
> > > >search engine and it seems right that it would have a "best of
class"
> > > >scoring formula out of the box.
> > >
> > > I fully agree. It's also a very good environment for
> > > designing/implementing
> > > another query language, although that by itself turns out to be
> > > harder than I anticipated...
> > >
> > >
> > > Regards,
> > > Paul Elschot
> > >
> > > On Sunday 17 October 2004 02:22, Chuck Williams wrote:
> > > > Doug Cutting wrote:
> > > > > If someone can demonstrate that an alternate formulation
produces
> > > > > superior results for most applications, then we should of
course
> > > >
> > > > change
> > > >
> > > > > the default implementation.  But just noting that there's a
factor
> > > >
> > > > which
> > > >
> > > > > is equal to idf^2 in each element of the sum does not do this.
> > > >
> > > > I researched the idf^2 issue further and believe that empirical
> >
> > studies
> >
> > > > have consistently concluded that one idf factor should be
dropped.
> > > > Salton, the originator of the IR vector space model, decided to
drop
> >
> > the
> >
> > > > idf term on documents in order to avoid the squaring.  I hear he
did
> > > > this after studying recall and precision for many variations of
his
> > > > formula.  Here is a quote from his Trec-3 paper, which
references
> >
> > the
> >
> > > > same comment in his Trec-1 and Trec-2 papers.  Note the final
> >
> > sentence:
> > > ...
> > >
> > > >    To allow a meaningful final retrieval similarity, it is
> >
> > convenient to
> >
> > > > use
> > > >    a length normalization factor as part of the term weighting
> >
> > formula.
> >
> > > > A
> > > >    high-quality term weighting formula for wik, the weight of
term
> >
> > Tk
> >
> > > >    in query Qi is
> > > >
> > > >      wik= [ log(fik) + 1.0) * log(N/nk) ] /
> > > >           sqrt(sum(j=1, t) [(log(fij) + 1.0) * log(N/nj)]^2)
> >
> > (1)
> >
> > > >    where fik is the occurrence frequency of Tk in Qi, N is the
> > > > collection
> > > >    size, and nk is the number of documents with term Tk
assigned.
> >
> > The
> >
> > > > factor
> > > >    log(N/nk) is an inverse collection frequency ("idf") factor
which
> > > >    decreases as terms are used widely in a collection, and the
> > > > denominator
> > > >    in expression (1) is used for weight normalization. This
> >
> > particular
> >
> > > > form
> > > >    will be called "ltc" weighting within this paper.
> > > >
> > > >    The weights assigned to terms in documents are much the same.
In
> > > > practice,
> > > >    for both effectiveness and efficiency reasons the idf factor
in
> >
> > the
> >
> > > >    documents is dropped.[2, 1]"
> > > >
> > > > Similarly, another successful scoring formula that has been
> >
> > extensively
> >
> > > > tuned through empirical studies, OKAPI, uses idf linearly and
not
> > > > quadratically.  I checked with some friends that are more expert
in
> >
> > this
> >
> > > > area than me, Edwin Cooper the founder and Chief Scientist at
> >
> > InQuira,
> >
> > > > and his father Bill Cooper, a pioneer in IR and professor
emeritus
> >
> > at
> >
> > > > Berkeley.  Both of them report that squaring the idf term seems
> > > > "strange" and is not consistent with the best known scoring
> >
> > formulas.
> >
> > > > At InQuira, we did extensive empirical tests for relevance in
many
> > > > different domains.  Our situation was different as the product
> >
> > retrieves
> >
> > > > specific passages so our model was much more complex (extracting
> > > > sentences or paragraphs from a large corpus that specifically
answer
> >
> > a
> >
> > > > natural language question -- e.g. try a question like "what are
roth
> >
> > vs
> >
> > > > regular iras" in the search box at www.bankofamerica.com).
However,
> >
> > we
> >
> > > > did include document relevance factors including tfidf and did
not
> > > > square the idf.  None of our testing indicated that would have
been
> >
> > an
> >
> > > > improvement, although we did not explicitly try it.
> > >
> > > ...
> > >
> > > > Chuck
> > > >
> > > > > -----Original Message-----
> > > > > From: Doug Cutting [mailto:[EMAIL PROTECTED]
> > > > > Sent: Wednesday, October 13, 2004 9:25 AM
> > > > > To: Lucene Developers List
> > > > > Subject: Re: Contribution: better multi-field searching
> > > > >
> > > > > Paul Elschot wrote:
> > > > > >>Did you see my IDF question at the bottom of the original
note?
> >
> > I'm
> >
> > > > > >>really curious why the square of IDF is used for Term and
Phrase
> > > > > >>queries, rather than just IDF.  It seems like it might be a
bug?
> > > > > >
> > > > > > I missed that.
> > > > > > It has been discussed recently, but I don't remember the
> >
> > outcome,
> >
> > > > > > perhaps some else?
> > > > >
> > > > > This has indeed been discussed before.
> > > > >
> > > > > Lucene computes a dot-product of a query vector and each
document
> > > > > vector.  Weights in both vectors are normalized tf*idf, i.e.,
> > > > > (tf*idf)/length.  The dot product of vectors d and q is:
> > > > >
> > > > >    score(d,q) =  sum over t of ( weight(t,q) * weight(t,d) )
> > > > >
> > > > > Given this formulation, and the use of tf*idf weights, each
> >
> > component
> >
> > > > of
> > > >
> > > > > the sum has an idf^2 factor.  That's just the way it works
with
> >
> > dot
> >
> > > > > products of tf*idf/length vectors.  It's not a bug.  If folks
> >
> > don't
> >
> > > > like
> > > >
> > > > > it they can simply override Similarity.idf() to return
> >
> > sqrt(super()).
> >
> > > > > If someone can demonstrate that an alternate formulation
produces
> > > > > superior results for most applications, then we should of
course
> > > >
> > > > change
> > > >
> > > > > the default implementation.  But just noting that there's a
factor
> > > >
> > > > which
> > > >
> > > > > is equal to idf^2 in each element of the sum does not do this.
> > > > >
> > > > > Doug
> > >
> > > ...
> > >
> > >
> > >
---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail:
[EMAIL PROTECTED]
> >
> >
---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: idf^2

Reply via email to