RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Chuck Williams Tue, 01 Feb 2005 11:40:39 -0800

I agree with all this if one is using Default-AND.

I prefer Default-OR, combined with scores that support the ability of
the application to know the quality of results returned (e.g., which
results match all terms, if any, and which don't, at the very least).
What you describe below as simultaneous consideration of coord and score
is along the lines of my score normalization proposal from a while back.
I think my score normalization proposal would yield the same benefits
within just the score mechanism.


I finally understand why things that seemed so obviously critical to me
weren't for you.  The fundamental difference is that I want users to see
results even if no document contains all query terms.  Then I want my UI
to communicate to users what they got.  I've found this approach to be
very effective, even if it is out of the current mainstream.

If the DensityQuery only supports Default-AND (which as currently
proposed is the case), then unfortunately I won't be able to use it and
will have to build my own mechanism.

Chuck

  > -----Original Message-----
  > From: Doug Cutting [mailto:[EMAIL PROTECTED]
  > Sent: Tuesday, February 01, 2005 11:05 AM
  > To: Lucene Developers List
  > Subject: Re: URL to compare 2 Similarity's ready-- Re: Scoring
benchmark
  > evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher
  > problems with Similarity.docFreq() ?
  > 
  > Chuck Williams wrote:
  > >   > So I think this can be implemented using the expansion I
proposed
  > >   > yesterday for MultiFieldQueryParser, plus something like my
  > >   > DensityPhraseQuery and perhaps a few Similarity tweaks.
  > >
  > > I don't think that works unless the mechanism is limited to
default-
  > AND
  > > (i.e., all clauses required).
  > 
  > Right.  I have repeatedly argued for default-AND.
  > 
  > > However, I don't see a way to integrate term proximity into that
  > > expansion.  Specifically, I don't see a way to handle proximity
and
  > > coverage simultaneously without managing the multiple fields,
field
  > > boosts and proximity considerations in a single query class.
Whence,
  > > the proposal for such a class.
  > 
  > To repeat my three-term, two-field example:
  > 
  > +(f1:t1^b1 f2:t1^b2)
  > +(f1:t2^b1 f2:t2^b2)
  > +(f1:t3^b1 f2:t3^b2)
  > f1:"t1 t2 t3"~s1^b3
  > f2:"t1 t2 t3"~s2^b4
  > 
  > Coverage is handled by the first three clauses.  Each term must
match in
  > at least one field.  Proximity is boosted by the last two clauses:
when
  > terms occur close together, the score is increased.  The
implementation
  > of the ~ operator could be improved, as I proposed.
  > 
  > > Do you see a way to do that?  I.e., do you see a scalable
expansion
  > that
  > > addresses all the issues for both default-or and default-and?
  > 
  > I am not really very interested in default-OR.  I think there are
good
  > reasons that folks have gravitated towards default-AND.  I would
prefer
  > we focus on a good default-AND solution for now.
  > 
  > If one wishes to rank things by coordination first, and then by
score,
  > as an improved default-OR, then one needs more than just score-based
  > ranking.  Trying to concoct scores that alone guarantee such a
ranking
  > is very fragile.  In general, one would need a HitCollector API that
  > takes both the coord and the score.  This is possible, but I'm not
in a
  > hurry to implement it.
  > 
  > Lucene's development is constrained.  We want to improve  Lucene, to
  > make search results better, to make it faster, and add needed
features,
  > but we must at the same time keep it back-compatible, maintainable
and
  > easy-to-use.  The smaller the code, the easier it is to maintain and
  > understand, so, e.g., a change that adds a lot of new code is harder
to
  > accept than one that just tweaks existing code a bit.  We are
changing
  > many APIs for Lucene 2.0, but we're also providing a clear migration
  > path for Lucene 1.X users.  When we add a new, improved API we must
  > deprecate the API it replaces and make sure that the new API
supports
  > all the features of the old API.  We cannot afford to maintain
multiple
  > implementations of similar functionality.  So, for these reasons, I
am
  > not comfortable simply comitting your
DistributingMultiFieldQueryParser
  > and MaxDisjunctionQuery.  We need to fit these into Lucene, figure
out
  > what they replace, etc.  Otherwise Lucene could just become a
  > hodge-podge of poorly maintained classes.  If we think these or
  > something like them do a better job, then we'd like it to be natural
for
  > folks upgrading to start using them in favor of old methods, so
that,
  > long term, we don't have to maintain both.  So the problem is not
simply
  > figuring out what a better default ranking algorithm is, it is also
  > figuring out how to sucessfully integrate such an algorithm into
Lucene.
  > 
  > > I think
  > > the query class I've proposed does that, and should be no more
complex
  > > than the current SpanQuery mechanism, for example.
  > 
  > The SpanQuery mechanism is quite complex and permits matching of a
  > completely different sort: fragments rather than whole documents.
What
  > you're proposing does not seem so radically different that it cannot
be
  > part of the normal document-matching mechansim.
  > 
  > > Also, it should be
  > > more efficient than a nested construction of more primitive
components
  > > since it can be directly optimized.
  > 
  > It might use a bit less CPU, but would not reduce i/o.  My proposal
  > processes TermDocs twice, but since Lucene processes query terms in
  > parallel, and with filesystem caching, no extra i/o will be
performed.
  > 
  > Doug
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Reply via email to