Re: Flexible index format / Payloads Cont'd

Nadav Har'El Tue, 04 Jul 2006 14:52:11 -0700

On Fri, Jun 30, 2006, Marvin Humphrey wrote about "Re: Flexible index format / 
Payloads Cont'd":
> >On Thu, Jun 29, 2006, Marvin Humphrey wrote about "Re: Flexible  
> >index format / Payloads Cont'd":
> >>  * Improve IR precision, by writing a Boolean Scorer that
> >>    takes position into account, a la Brin/Page '98.
> >
> >Yes, I'd love to see that too (and it doesn't even require any new  
> >payloads
> >support, the positions that Lucene already has are enough).
> 
> True.  Any intrepid volunteers jonesing to hack on BooleanScorer2?   
> Yeeha!


I felt somewhat intrepid, so I decided to try and do this myself.
Unfortunately, it turns out to be much more complicated than I thought.

The problem is that Scorer, and it's implementations - BooleanScorer2,
DisjunctionSumScorer and ConjunctionScorer - only work on the document
level. Scorer has next() and skipTo(), but no way to view positions
inside the document. If you look at the lowest level Scorer, TermScorer,
it uses TermDocs and not TermPositions.
So I couldn't figure out a way to hack on BooleanScorer2 to change the
score by positions.

Ok, then, I thought to myself - the normal queries and scorers only work
on the document level and don't use positions - but SpanQueries have positions
so I can create some sort of ProximityBooleanSpanQuery, right? Well,
unfortunately, I couldn't figure out how, yet. SpanScorer is a Scorer
as usual, and still doesn't have access to the positions. It does keep
"spans", and gives a score according to their lengths, but I couldn't
figure out how I could use this facility to do what we want.

Lastly, I looked at what LoosePhraseScorer looks like, to understand how
phrases do get to use positions. It appears that this scorer gets initialized
with the TermPositions of each term, which includes the positions. This
is great, but it means that it a phrase can only contain terms (words) -
LoosePhraseScorer could not handle more complex sub-queries, and their
own Scorers. But it would have been nice if the proximity-enhanced boolean
query could support not just term sub-queries.

> Right now, the boolean scorers scan through freqs for all terms, but  
> positions for only some terms.  For common terms, which is where the  
> bulk of the cost lies in scoring, scanning though both freqs and  
> positions involves a number of disk seeks, as .frq and .prx are  
> consumed in 1k chunks.  This is an area where OS caching is unlikely  
> to help too much, as we're talking about a lot of data.

I'm not sure I follow. You say the boolean scorers currently scan positions
for some terms. I didn't see this happening. Or, do you mean in case one
of the clauses is, say, a phrase, in which case the sub-scorer is the one
that scans positions?

> A boolean scorer requiring that positions be read for *all* terms  
> will cost more.  However, by merging the freq and prox files, those  
> disk seeks are eliminated, as all the freq/prox data for a term can  
> be slurped up in one contiguous read.  That may serve to mitigate the  
> costs some.

You are absolutely right. 

> However, simple term queries, at least those against fields where  
> positions are stored, will cost more -- because it will be necessary  
> to scan past irrelevant positional data.  I think people who do a lot  
> of yes/no, unscored matches might be unhappy about that.

I think that just like we can say for a certain field that it shouldn't
have norms, it should be possible to say about a certain field that it
doesn't have positions. Consider a case where you know in advance that
a field's value will only be used to filter results, E.g., the field's
value is a list of categories the document belongs to. You never intend
to use this field in a phrase search or even to score matches, so you
simply don't need to store positions.
Such a simple posting list, containing just the list of documents without
any intra-document positions or payloads is sometimes known as "filter
posting list" or "binary posting list".

This per-field flag, "no positions", should be part of the general redesign
of the index structure that will also allow for per-document and per-position
payloads.

> One more note: Though payloads are not necessary for exploiting  
> positional data, associating a boost with each position opens the  
> door to an additional improvement in IR precision.  The Googs, for  
> instance, describe dedicating 4-8 bits per posting to text size, so  
> that e.g. text between <h1> tags gets weighted more heavily than text  
> between <p> tags.

Indeed.

If you want a "poor man's version" of their capability, before per-position
payloads are added to lucene, you can try this simple trick: double every
word inside the <h1>. This will give these words a boost compared to the
other words. Of course, it's easy to double words but you can't do this
with fractions like 1.5 :-)


-- 
Nadav Har'El                        |     Wednesday, Jul 5 2006, 9 Tammuz 5766
IBM Haifa Research Lab              |-----------------------------------------
                                    |Why do we drive on a parkway and park on
http://nadav.harel.org.il           |a driveway?

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Flexible index format / Payloads Cont'd

Reply via email to