Proximity-enhanced boolean scoring (was: Re: Flexible index format / Payloads Cont'd)

Nadav Har'El Thu, 06 Jul 2006 03:15:49 -0700

On Wed, Jul 05, 2006, Paul Elschot wrote about "Re: Flexible index format / 
Payloads Cont'd":
> > Ok, then, I thought to myself - the normal queries and scorers only work
> > on the document level and don't use positions - but SpanQueries have 
> positions
> > so I can create some sort of ProximityBooleanSpanQuery, right? Well,
> > unfortunately, I couldn't figure out how, yet. SpanScorer is a Scorer
>...
> 
> SpanQueries can be nested because they pass around
> Spans to higher levels for scoring at the top level of the proximity.


Ok, I've started writing a class I call ProximityBooleanQuery, which unlike
BooleanQuery will need to contain SpanQueries, not just any Query, as clauses.

My idea is that ProximityBooleanQuery's Scorer will sum the scores of the
individual clauses (just like in BooleanQuery), but further increase the
score depending on how matches we find nearby in the same document (to figure
this out, I will use the Spans contained in the subdocuments).

One of the peculiar things I noticed while experimenting with this approach
is that SpanTermQuery's scorer is different from the regular TermQuery's
scorer - its scores always appear multiplied by 1/sqrt(2) compared to
TermQuery's scores. Is this deliberate? If not, should it perhaps be fixed?

> So a minimum form of "ProximityBooleanSpanQuery" is already there
> in Lucene. It is implemented by using a SpanScorer as a subscorer
> of a BooleanScorer2, and by having this SpansScorer use the proximity
> information passed up from the bottom level SpanTermQueries, normally
> via some other SpanQuery like SpanNearQuery.

I'm not sure I understand what you mean. Perhaps you mean something like
the simple solution I described in a previous mail, where I added to a
normal BooleanQuery several additional SpanNearQueries, one for each pair
of terms in the query. This solution works quite well, but I thought it
is inefficient which is why I was looking to come up with a more basic
solution.

> It might be possible to subclass Scorer to incorporate more position info,
> but SpanQueries have a slightly different take, they use Spans to pass 
> the position info around.
> This is also the reason why Lucene has some difficulty in weighting
> the subqueries of a SpanQuery: unlike a Scorer, a Spans does not have
> a score or weight value, and SpanScorer is used to provide the score, but
> only at the top level of the proximity structure.
> This could be changed adding a weight to Spans, or by adding some
> form of position info to (a subclass of) Scorer.

Yes, I think you described the situation well. At this stage, I'll continue
to try to develop this feature using Lucene's existing Spans/SpanQuery
framework. I hope this is possible, because the ideas you raised (adding
weight to Spans or spans to Scorer) will require significant changes to
many of Lucene's existing query types, or duplication of these query
types, something which I'd rather avoid if possible.

-- 
Nadav Har'El                        |     Thursday, Jul 6 2006, 10 Tammuz 5766
IBM Haifa Research Lab              |-----------------------------------------
                                    |Cement mixer collided with a prison van.
http://nadav.harel.org.il           |Look out for sixteen hardened criminals.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Proximity-enhanced boolean scoring (was: Re: Flexible index format / Payloads Cont'd)

Reply via email to