On Apr 27, 2006, at 12:17 PM, Doug Cutting wrote:

Marvin Humphrey wrote:
Moving away from cached norms was the second of three major changes to the file format on my agenda, and the one I was all but certain I wouldn't be able to sell to the Lucene community. The first was using bytecounts at the head of Strings. The third was storing start offsets and end offsets in the ProxFile. It rankles that much of the information from tis/frq/ prx gets duplicated in the term vector files, but highlighting is most efficient when you know the offsets, and the primary index stops short of storing that information. Currently, we have this:
    ProxFile (.prx) -->  <TermPositions>TermCount
How about this?
    ProxFile (.prx) -->  <TermPositions,TermOffsets>TermCount

This would at least double the size of the .prx file, the largest file in Lucene's index. Yes it's useful, not not all folks will use it. So not all folks should have to pay for it.

Agreed. I think it would at least triple the ProxFile, actually. At least, I haven't thought of a compression scheme which could cram both start offset and end offset into fewer than two bytes on average. But in theory, it would eliminate the need for the Term Vectors files, so if you need those now only for highlighting, it's a big gain.

So perhaps instead we should simply allocate more bits in the FieldInfo. We could allocate bits for WEIGHT_PER_POSITION, OFFSETS_IN_PRX, NORMS_IN_FRQ, OMIT_PRX, OMIT_FREQ, etc. We can increase the number of bits there by turning this into a VInt, which would be back-compatible, no?

Using a VInt there sounds good 'n' clever to me. Supporting all those different configs is another question. "Flexibility is overrated." -- David Hansson.

It's charitable of you to include NORMS_IN_FRQ in that list, but in my mind, the idea was obsolesced the instant I saw WEIGHT_PER_POSITION. Both enable fast launch of a Searcher. That's the only benefit of NORMS_IN_FRQ, and it comes at the expense of increased file size in comparison to the same index without NORMS_IN_FRQ. WEIGHT_PER_POSITION has much more potential.

My primary goal with enabling fast launch is to make it so only the largest installations have to worry about running under mod_perl and caching Searchers. Simple installations, whether they are set up by a novice or by a sophisticated user who doesn't want to deal with mod_perl for a zillion possible reasons, should "just work" for as large an index as possible. The main concern out-of-the-gate is ease of use, and I'm happy to trade off increased file size to get it.

To turn the idea on its head... I'm inclined to make WEIGHT_PER_POSITION the default behavior, and either deep-six the cacheable norms files altogether or "add" them as an "expert optimization": decreased file size and improved search speed at the expense of less control over scoring and slower startup.

Incidentally, how about calling it BOOST_PER_POSITION instead?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to