On Apr 27, 2006, at 9:41 AM, Doug Cutting wrote:

karl wettin wrote:
My own immediate thought is to compromise by allowing boost per term in document. Simply remove the norms-methods from the IndexReader and add a new one to the TermEnum and fall back on the field boost. How would the value be picked up by the scorer?
Boost per position, et.c. sounds very expensive.

Indeed. It will probably nearly double the size of indexes and also increase search time.

I have been considering making a similar change to the KinoSearch file format. Not having to cache norms radically cuts down on the time required to launch a fresh Searcher, especially if there aren't any deleted docs. That's a win if you're launching a search app from scratch, like if you're running a web search under CGI rather than mod_perl. It's also a win for refreshing a Searcher against a frequently updated index.

What I was considering was interleaving the document's score- multiplier norm byte between the VInts in the .frq file. That would mean more disk i/o for processing terms when the term takes up more than a block on the file system, but at least the info would be contiguous.

I hadn't considered interleaving the score-multiplier into .prx, but that opens many possibilities. Boost positions that appear near the top of the doc. Boost positions if they occur within certain HTML tags. Good stuff!

Moving away from cached norms was the second of three major changes to the file format on my agenda, and the one I was all but certain I wouldn't be able to sell to the Lucene community. The first was using bytecounts at the head of Strings.

The third was storing start offsets and end offsets in the ProxFile. It rankles that much of the information from tis/frq/prx gets duplicated in the term vector files, but highlighting is most efficient when you know the offsets, and the primary index stops short of storing that information. Currently, we have this:

    ProxFile (.prx) -->  <TermPositions>TermCount

How about this?

    ProxFile (.prx) -->  <TermPositions,TermOffsets>TermCount

To get highlighting info now, you retrieve a document's term vector information and then extract the offsets information for the precise term. This format reverses the order: first you find the term, then you extract the offsets info for a particular doc.

I haven't implemented this change yet, so I'm not sure how it works out. The current version of KinoSearch stores term vectors in the .fdt file, which is a win for locality of reference. It sure would be nice to eliminate all that duplicated data, though.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to