On Tue, Dec 23, 2008 at 11:02:56PM -0600, robert engels wrote: > Seems doubtful you will be able to do this without increasing the > index size dramatically. Since it will need to be stored > "unpacked" (in order to have random access), yet the terms are > variable length - leading to using a maximum=minimum size for every > term.
Wow. That's a spectacularly awful design. Its worst case -- one outlier term, say, 1000 characters in length, in a field where the average term length is in the single digits -- would explode the index size and incur wasteful IO overhead, just as you say. Good thing we've never considered it. :) I'm hoping we can improve on this, but for now, we've ended up at a two-file design for the term dictionary index. 1) Stacked 64-bit file pointers. 2) Variable length character and term info data, interpreted using a pluggable codec. In the index at least, each entry would contain the full term text, encoded as UTF-8. Probably the primary term dictionary would continue to use string diffs. That design offers no significant benefits other than those that flow from compatibility with mmap: faster IndexReader open/reaopen, lower RAM usage under multiple processes by way of buffer sharing. IO bandwidth requirements and speed are probably a little better, but lookups on the term dictionary index are not a significant search-time bottleneck. Additionally, sort caches would be written at index time in three files, and memory mapped as laid out in <https://issues.apache.org/jira/browse/LUCENE-831?focusedCommentId=12656150#action_12656150>. 1) Stacked 64-bit file pointers. 2) Character data. 3) Doc num to ord mapping. Marvin Humphrey --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org