On Fri, Jun 30, 2006, Marvin Humphrey wrote about "Re: Flexible index format / Payloads Cont'd": > >On Thu, Jun 29, 2006, Marvin Humphrey wrote about "Re: Flexible > >index format / Payloads Cont'd": > >> * Improve IR precision, by writing a Boolean Scorer that > >> takes position into account, a la Brin/Page '98. > > > >Yes, I'd love to see that too (and it doesn't even require any new > >payloads > >support, the positions that Lucene already has are enough). > > True. Any intrepid volunteers jonesing to hack on BooleanScorer2? > Yeeha!
I felt somewhat intrepid, so I decided to try and do this myself. Unfortunately, it turns out to be much more complicated than I thought. The problem is that Scorer, and it's implementations - BooleanScorer2, DisjunctionSumScorer and ConjunctionScorer - only work on the document level. Scorer has next() and skipTo(), but no way to view positions inside the document. If you look at the lowest level Scorer, TermScorer, it uses TermDocs and not TermPositions. So I couldn't figure out a way to hack on BooleanScorer2 to change the score by positions. Ok, then, I thought to myself - the normal queries and scorers only work on the document level and don't use positions - but SpanQueries have positions so I can create some sort of ProximityBooleanSpanQuery, right? Well, unfortunately, I couldn't figure out how, yet. SpanScorer is a Scorer as usual, and still doesn't have access to the positions. It does keep "spans", and gives a score according to their lengths, but I couldn't figure out how I could use this facility to do what we want. Lastly, I looked at what LoosePhraseScorer looks like, to understand how phrases do get to use positions. It appears that this scorer gets initialized with the TermPositions of each term, which includes the positions. This is great, but it means that it a phrase can only contain terms (words) - LoosePhraseScorer could not handle more complex sub-queries, and their own Scorers. But it would have been nice if the proximity-enhanced boolean query could support not just term sub-queries. > Right now, the boolean scorers scan through freqs for all terms, but > positions for only some terms. For common terms, which is where the > bulk of the cost lies in scoring, scanning though both freqs and > positions involves a number of disk seeks, as .frq and .prx are > consumed in 1k chunks. This is an area where OS caching is unlikely > to help too much, as we're talking about a lot of data. I'm not sure I follow. You say the boolean scorers currently scan positions for some terms. I didn't see this happening. Or, do you mean in case one of the clauses is, say, a phrase, in which case the sub-scorer is the one that scans positions? > A boolean scorer requiring that positions be read for *all* terms > will cost more. However, by merging the freq and prox files, those > disk seeks are eliminated, as all the freq/prox data for a term can > be slurped up in one contiguous read. That may serve to mitigate the > costs some. You are absolutely right. > However, simple term queries, at least those against fields where > positions are stored, will cost more -- because it will be necessary > to scan past irrelevant positional data. I think people who do a lot > of yes/no, unscored matches might be unhappy about that. I think that just like we can say for a certain field that it shouldn't have norms, it should be possible to say about a certain field that it doesn't have positions. Consider a case where you know in advance that a field's value will only be used to filter results, E.g., the field's value is a list of categories the document belongs to. You never intend to use this field in a phrase search or even to score matches, so you simply don't need to store positions. Such a simple posting list, containing just the list of documents without any intra-document positions or payloads is sometimes known as "filter posting list" or "binary posting list". This per-field flag, "no positions", should be part of the general redesign of the index structure that will also allow for per-document and per-position payloads. > One more note: Though payloads are not necessary for exploiting > positional data, associating a boost with each position opens the > door to an additional improvement in IR precision. The Googs, for > instance, describe dedicating 4-8 bits per posting to text size, so > that e.g. text between <h1> tags gets weighted more heavily than text > between <p> tags. Indeed. If you want a "poor man's version" of their capability, before per-position payloads are added to lucene, you can try this simple trick: double every word inside the <h1>. This will give these words a boost compared to the other words. Of course, it's easy to double words but you can't do this with fractions like 1.5 :-) -- Nadav Har'El | Wednesday, Jul 5 2006, 9 Tammuz 5766 IBM Haifa Research Lab |----------------------------------------- |Why do we drive on a parkway and park on http://nadav.harel.org.il |a driveway? --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]