Re: Flexible index format / Payloads Cont'd

Marvin Humphrey Fri, 30 Jun 2006 07:48:27 -0700


On Jun 30, 2006, at 6:07 AM, Nadav Har'El wrote:

On Thu, Jun 29, 2006, Marvin Humphrey wrote about "Re: Flexibleindex format / Payloads Cont'd":
  * Improve IR precision, by writing a Boolean Scorer that
    takes position into account, a la Brin/Page '98.
Yes, I'd love to see that too (and it doesn't even require any newpayloads
support, the positions that Lucene already has are enough).

True. Any intrepid volunteers jonesing to hack on BooleanScorer2?Yeeha!

The reason I included this in my summary rather than separating itout into something we could do earlier was locality of reference.

Right now, the boolean scorers scan through freqs for all terms, butpositions for only some terms. For common terms, which is where thebulk of the cost lies in scoring, scanning though both freqs andpositions involves a number of disk seeks, as .frq and .prx areconsumed in 1k chunks. This is an area where OS caching is unlikelyto help too much, as we're talking about a lot of data.

A boolean scorer requiring that positions be read for *all* termswill cost more. However, by merging the freq and prox files, thosedisk seeks are eliminated, as all the freq/prox data for a term canbe slurped up in one contiguous read. That may serve to mitigate thecosts some.

However, simple term queries, at least those against fields wherepositions are stored, will cost more -- because it will be necessaryto scan past irrelevant positional data. I think people who do a lotof yes/no, unscored matches might be unhappy about that.

Generally, I'm concerned about anyone who has fine-tuned their systemfor search-time throughput. Adding additional search-time costs maypush some of these systems over the edge. As a total package, Ithink the power of the changes easily justifies the price, andfurthermore, IR precision cannot be bought with more hardware, whilethroughput can. But I suspect there will be some interested partieswho will disagree, and I'm sympathetic -- it would be a real bummerif costly "improvements" to BooleanScorer2 made your app unworkable.


BooleanScorer3 anyone?  Oi.

I tried a small test using the Trec 8 corpus and query-relevancejudgements,
and saw a noticable improvement in precision when I added a simplistic
version of this feature: I "or"ed the original query words with
SpanNearQuery's of each pair of words in the query, so the query of
"hot dog bun" will be converted to something similar to:
hot OR dog OR bun OR "hot dog"~7^0.25 "dog bun"~7^0.25 "hotbun"~7^0.25


Nifty example!

One more note: Though payloads are not necessary for exploitingpositional data, associating a boost with each position opens thedoor to an additional improvement in IR precision. The Googs, forinstance, describe dedicating 4-8 bits per posting to text size, sothat e.g. text between <h1> tags gets weighted more heavily than textbetween <p> tags.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Flexible index format / Payloads Cont'd

Reply via email to