On Jun 30, 2006, at 6:07 AM, Nadav Har'El wrote:

On Thu, Jun 29, 2006, Marvin Humphrey wrote about "Re: Flexible index format / Payloads Cont'd":
  * Improve IR precision, by writing a Boolean Scorer that
    takes position into account, a la Brin/Page '98.

Yes, I'd love to see that too (and it doesn't even require any new payloads
support, the positions that Lucene already has are enough).

True. Any intrepid volunteers jonesing to hack on BooleanScorer2? Yeeha!

The reason I included this in my summary rather than separating it out into something we could do earlier was locality of reference.

Right now, the boolean scorers scan through freqs for all terms, but positions for only some terms. For common terms, which is where the bulk of the cost lies in scoring, scanning though both freqs and positions involves a number of disk seeks, as .frq and .prx are consumed in 1k chunks. This is an area where OS caching is unlikely to help too much, as we're talking about a lot of data.

A boolean scorer requiring that positions be read for *all* terms will cost more. However, by merging the freq and prox files, those disk seeks are eliminated, as all the freq/prox data for a term can be slurped up in one contiguous read. That may serve to mitigate the costs some.

However, simple term queries, at least those against fields where positions are stored, will cost more -- because it will be necessary to scan past irrelevant positional data. I think people who do a lot of yes/no, unscored matches might be unhappy about that.

Generally, I'm concerned about anyone who has fine-tuned their system for search-time throughput. Adding additional search-time costs may push some of these systems over the edge. As a total package, I think the power of the changes easily justifies the price, and furthermore, IR precision cannot be bought with more hardware, while throughput can. But I suspect there will be some interested parties who will disagree, and I'm sympathetic -- it would be a real bummer if costly "improvements" to BooleanScorer2 made your app unworkable.

BooleanScorer3 anyone?  Oi.

I tried a small test using the Trec 8 corpus and query-relevance judgements,
and saw a noticable improvement in precision when I added a simplistic
version of this feature: I "or"ed the original query words with
SpanNearQuery's of each pair of words in the query, so the query of
"hot dog bun" will be converted to something similar to:

hot OR dog OR bun OR "hot dog"~7^0.25 "dog bun"~7^0.25 "hot bun"~7^0.25

Nifty example!

One more note: Though payloads are not necessary for exploiting positional data, associating a boost with each position opens the door to an additional improvement in IR precision. The Googs, for instance, describe dedicating 4-8 bits per posting to text size, so that e.g. text between <h1> tags gets weighted more heavily than text between <p> tags.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to