I'm digging deeper into the Lucene index format to develop some higher level diagrams of its structure. One thing that is curious to me is the term text being stored in the .tvf file. Why not point to the term dictionary by position somehow and avoid duplicating this string, saving possibly substantial index size? I'm assuming this is for performance reasons.

Note, the Lucene index file formats documentation needs to be updated - TermText is no longer just a String, it is a <PrefixLength,Suffix> similar to how terms in the .tis are stored. I've updated fileformats.xml/.html - if I've gotten this wrong, let me know.

Just out of curiosity - are there any other known inconsistencies with the file formats documentation? I'd be happy to fix them up if there are any other out of sync issues. I just happened to spot the one just mentioned because I looked in the code to see how term vectors were written when I saw that the term text is duplicated.

        Erik


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to