On Apr 27, 2006, at 12:17 PM, Doug Cutting wrote:
Marvin Humphrey wrote:
Moving away from cached norms was the second of three major
changes to the file format on my agenda, and the one I was all
but certain I wouldn't be able to sell to the Lucene community.
The first was using bytecounts at the head of Strings.
The third was storing start offsets and end offsets in the
ProxFile. It rankles that much of the information from tis/frq/
prx gets duplicated in the term vector files, but highlighting is
most efficient when you know the offsets, and the primary index
stops short of storing that information. Currently, we have this:
ProxFile (.prx) --> <TermPositions>TermCount
How about this?
ProxFile (.prx) --> <TermPositions,TermOffsets>TermCount
This would at least double the size of the .prx file, the largest
file in Lucene's index. Yes it's useful, not not all folks will
use it. So not all folks should have to pay for it.
Agreed. I think it would at least triple the ProxFile, actually. At
least, I haven't thought of a compression scheme which could cram
both start offset and end offset into fewer than two bytes on
average. But in theory, it would eliminate the need for the Term
Vectors files, so if you need those now only for highlighting, it's a
big gain.
So perhaps instead we should simply allocate more bits in the
FieldInfo. We could allocate bits for WEIGHT_PER_POSITION,
OFFSETS_IN_PRX, NORMS_IN_FRQ, OMIT_PRX, OMIT_FREQ, etc. We can
increase the number of bits there by turning this into a VInt,
which would be back-compatible, no?
Using a VInt there sounds good 'n' clever to me. Supporting all
those different configs is another question. "Flexibility is
overrated." -- David Hansson.
It's charitable of you to include NORMS_IN_FRQ in that list, but in
my mind, the idea was obsolesced the instant I saw
WEIGHT_PER_POSITION. Both enable fast launch of a Searcher. That's
the only benefit of NORMS_IN_FRQ, and it comes at the expense of
increased file size in comparison to the same index without
NORMS_IN_FRQ. WEIGHT_PER_POSITION has much more potential.
My primary goal with enabling fast launch is to make it so only the
largest installations have to worry about running under mod_perl and
caching Searchers. Simple installations, whether they are set up by
a novice or by a sophisticated user who doesn't want to deal with
mod_perl for a zillion possible reasons, should "just work" for as
large an index as possible. The main concern out-of-the-gate is ease
of use, and I'm happy to trade off increased file size to get it.
To turn the idea on its head... I'm inclined to make
WEIGHT_PER_POSITION the default behavior, and either deep-six the
cacheable norms files altogether or "add" them as an "expert
optimization": decreased file size and improved search speed at the
expense of less control over scoring and slower startup.
Incidentally, how about calling it BOOST_PER_POSITION instead?
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]