On Dec 22, 2006, at 10:36 AM, Doug Cutting wrote:
The easiest way to do this would be to have separate files in each segment for each PostingFormat. It would be better if different posting formats could share files, but that's harder to coordinate.
The approach I'm taking in KinoSearch 0.20 is for each field to get its own postings file, named _XXX.pYYY, where "_XXX" is the segment name and "YYY" is the field number. That allows a single decoder to be pointed at each file. _XXX.frq and _XXX.prx have been eliminated.
One file per format would also work.
Alternately we could force all postings into a single file per segment. That would simplify the APIs, but prohibit certain file formats, like the one Lucene uses currently.
In theory, we could also have one file per property: doc num, freq, positions, boost, payload. The base Posting object would have only document number, and each subclass would add a new property, and a new file.
I'm not sure that's better, as it precludes optimizations such as the even/odd trick currently used in _XXX.frq, but it merits mention as the conceptual opposite of having one file per format.
Matchers would be happy with that scheme no matter what.
So the ideal solution would permit both different formats to either share a file, or to use their own file(s). Is it worth the complexity this would add to the API? Or should we jettison the notion of multiple posting files per segment?
Does punting on this issue have any drawbacks other than an unknown performance impact? Can we design the API so that we leave open the option of allowing the user to spec multiple files if that proves advantageous later?
Marvin Humphrey Rectangular Research http://www.rectangular.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]