On Jun 30, 2006, at 6:07 AM, Nadav Har'El wrote:
On Thu, Jun 29, 2006, Marvin Humphrey wrote about "Re: Flexible
index format / Payloads Cont'd":
* Improve IR precision, by writing a Boolean Scorer that
takes position into account, a la Brin/Page '98.
Yes, I'd love to see that too (and it doesn't even require any new
payloads
support, the positions that Lucene already has are enough).
True. Any intrepid volunteers jonesing to hack on BooleanScorer2?
Yeeha!
The reason I included this in my summary rather than separating it
out into something we could do earlier was locality of reference.
Right now, the boolean scorers scan through freqs for all terms, but
positions for only some terms. For common terms, which is where the
bulk of the cost lies in scoring, scanning though both freqs and
positions involves a number of disk seeks, as .frq and .prx are
consumed in 1k chunks. This is an area where OS caching is unlikely
to help too much, as we're talking about a lot of data.
A boolean scorer requiring that positions be read for *all* terms
will cost more. However, by merging the freq and prox files, those
disk seeks are eliminated, as all the freq/prox data for a term can
be slurped up in one contiguous read. That may serve to mitigate the
costs some.
However, simple term queries, at least those against fields where
positions are stored, will cost more -- because it will be necessary
to scan past irrelevant positional data. I think people who do a lot
of yes/no, unscored matches might be unhappy about that.
Generally, I'm concerned about anyone who has fine-tuned their system
for search-time throughput. Adding additional search-time costs may
push some of these systems over the edge. As a total package, I
think the power of the changes easily justifies the price, and
furthermore, IR precision cannot be bought with more hardware, while
throughput can. But I suspect there will be some interested parties
who will disagree, and I'm sympathetic -- it would be a real bummer
if costly "improvements" to BooleanScorer2 made your app unworkable.
BooleanScorer3 anyone? Oi.
I tried a small test using the Trec 8 corpus and query-relevance
judgements,
and saw a noticable improvement in precision when I added a simplistic
version of this feature: I "or"ed the original query words with
SpanNearQuery's of each pair of words in the query, so the query of
"hot dog bun" will be converted to something similar to:
hot OR dog OR bun OR "hot dog"~7^0.25 "dog bun"~7^0.25 "hot
bun"~7^0.25
Nifty example!
One more note: Though payloads are not necessary for exploiting
positional data, associating a boost with each position opens the
door to an additional improvement in IR precision. The Googs, for
instance, describe dedicating 4-8 bits per posting to text size, so
that e.g. text between <h1> tags gets weighted more heavily than text
between <p> tags.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]