Rich positions (was "boosting fields")

Marvin Humphrey Thu, 27 Apr 2006 11:58:52 -0700


On Apr 27, 2006, at 9:41 AM, Doug Cutting wrote:

karl wettin wrote:
My own immediate thought is to compromise by allowing boost perterm in document. Simply remove the norms-methods from theIndexReader and add a new one to the TermEnum and fall back onthe field boost. How would the value be picked up by the scorer?
Boost per position, et.c. sounds very expensive.
Indeed. It will probably nearly double the size of indexes andalso increase search time.

I have been considering making a similar change to the KinoSearchfile format. Not having to cache norms radically cuts down on thetime required to launch a fresh Searcher, especially if there aren'tany deleted docs. That's a win if you're launching a search app fromscratch, like if you're running a web search under CGI rather thanmod_perl. It's also a win for refreshing a Searcher against afrequently updated index.

What I was considering was interleaving the document's score-multiplier norm byte between the VInts in the .frq file. That wouldmean more disk i/o for processing terms when the term takes up morethan a block on the file system, but at least the info would becontiguous.

I hadn't considered interleaving the score-multiplier into .prx, butthat opens many possibilities. Boost positions that appear near thetop of the doc. Boost positions if they occur within certain HTMLtags. Good stuff!

Moving away from cached norms was the second of three major changesto the file format on my agenda, and the one I was all but certain Iwouldn't be able to sell to the Lucene community. The first wasusing bytecounts at the head of Strings.

The third was storing start offsets and end offsets in the ProxFile.It rankles that much of the information from tis/frq/prx getsduplicated in the term vector files, but highlighting is mostefficient when you know the offsets, and the primary index stopsshort of storing that information. Currently, we have this:


    ProxFile (.prx) -->  <TermPositions>TermCount

How about this?

    ProxFile (.prx) -->  <TermPositions,TermOffsets>TermCount

To get highlighting info now, you retrieve a document's term vectorinformation and then extract the offsets information for the preciseterm. This format reverses the order: first you find the term, thenyou extract the offsets info for a particular doc.

I haven't implemented this change yet, so I'm not sure how it worksout. The current version of KinoSearch stores term vectors inthe .fdt file, which is a win for locality of reference. It surewould be nice to eliminate all that duplicated data, though.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Rich positions (was "boosting fields")

Reply via email to