On Apr 27, 2006, at 9:41 AM, Doug Cutting wrote:
karl wettin wrote:
My own immediate thought is to compromise by allowing boost per
term in document. Simply remove the norms-methods from the
IndexReader and add a new one to the TermEnum and fall back on
the field boost. How would the value be picked up by the scorer?
Boost per position, et.c. sounds very expensive.
Indeed. It will probably nearly double the size of indexes and
also increase search time.
I have been considering making a similar change to the KinoSearch
file format. Not having to cache norms radically cuts down on the
time required to launch a fresh Searcher, especially if there aren't
any deleted docs. That's a win if you're launching a search app from
scratch, like if you're running a web search under CGI rather than
mod_perl. It's also a win for refreshing a Searcher against a
frequently updated index.
What I was considering was interleaving the document's score-
multiplier norm byte between the VInts in the .frq file. That would
mean more disk i/o for processing terms when the term takes up more
than a block on the file system, but at least the info would be
contiguous.
I hadn't considered interleaving the score-multiplier into .prx, but
that opens many possibilities. Boost positions that appear near the
top of the doc. Boost positions if they occur within certain HTML
tags. Good stuff!
Moving away from cached norms was the second of three major changes
to the file format on my agenda, and the one I was all but certain I
wouldn't be able to sell to the Lucene community. The first was
using bytecounts at the head of Strings.
The third was storing start offsets and end offsets in the ProxFile.
It rankles that much of the information from tis/frq/prx gets
duplicated in the term vector files, but highlighting is most
efficient when you know the offsets, and the primary index stops
short of storing that information. Currently, we have this:
ProxFile (.prx) --> <TermPositions>TermCount
How about this?
ProxFile (.prx) --> <TermPositions,TermOffsets>TermCount
To get highlighting info now, you retrieve a document's term vector
information and then extract the offsets information for the precise
term. This format reverses the order: first you find the term, then
you extract the offsets info for a particular doc.
I haven't implemented this change yet, so I'm not sure how it works
out. The current version of KinoSearch stores term vectors in
the .fdt file, which is a win for locality of reference. It sure
would be nice to eliminate all that duplicated data, though.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]