I'm digging deeper into the Lucene index format to develop some higher level diagrams of its structure. One thing that is curious to me is the term text being stored in the .tvf file. Why not point to the term dictionary by position somehow and avoid duplicating this string, saving possibly substantial index size? I'm assuming this is for performance reasons.
The prefix compression helps some, but you're right, each term in a vector requires several bytes when it could optimally be represented as perhaps just one or two bytes on average if we numbered terms.
The problem is maintaining the numbering as the index grows and changes. Lucene indexes grow by merging segments. With term numbers, each segment would have a separate term numbering system. Terms would be renumbered as segments are merged. This is not hard to implement. When you merge the term dictionaries, keep an array per segment mapping its old term numbers to new term numbers in the merged index. Then use these arrays to upgrade the vectors to the new numbering as they're copied into the new segment index. So far so good. It requires 4 bytes per document of RAM when merging. That makes optimizing large indexes much more memory intensive than it is currently, but not prohibitively.
But what happens when you have an unoptimized index and you want to compare vectors from two different segments? There's no way to do this without looking up all of the terms in each segment's term dictionary. This requires a random disk access per vector term and would hence be prohibitively slow. MultiSearcher would have the same problem.
So term-number-based vectors would be small and fast to use if all you're using is a single, optimized index, but very slow to use with unoptimized indexes and multiple indexes. That seems like a bad situtation, so, unless someone figures out another way, we're stuck with the current approach. Vectors are bigger and slower than optimal, but they're consistently so.
Note, the Lucene index file formats documentation needs to be updated - TermText is no longer just a String, it is a <PrefixLength,Suffix> similar to how terms in the .tis are stored. I've updated fileformats.xml/.html - if I've gotten this wrong, let me know.
Looks good to me. Thanks for catching this!
Just out of curiosity - are there any other known inconsistencies with the file formats documentation?
Good question. Let me think...
The segments file has also changed format, and this is not yet reflected in the file format documentation.
The skip data description is new. The text is clumsy, but I think it is mostly accurate. One mistake is that TIFormat is now -2, not -1. Other than that, it looks right to me.
We should probably also somewhere make clear what's changed. We promise to do so at the top of the file, but don't. So perhaps sections which have changed should get "since 1.4" or "changed in 1.4" notices or somesuch. This will make life much easier for ports.
Doug
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]