On Tue, Dec 23, 2008 at 11:02:56PM -0600, robert engels wrote:
> Seems doubtful you will be able to do this without increasing the
> index size dramatically. Since it will need to be stored
> "unpacked" (in order to have random access), yet the terms are
> variable length - leading to using a maximum=minimum size for every
> term.
Wow. That's a spectacularly awful design. Its worst case -- one outlier
term, say, 1000 characters in length, in a field where the average term length
is in the single digits -- would explode the index size and incur wasteful IO
overhead, just as you say.
Good thing we've never considered it. :)
I'm hoping we can improve on this, but for now, we've ended up at a two-file
design for the term dictionary index.
1) Stacked 64-bit file pointers.
2) Variable length character and term info data, interpreted using a
pluggable codec.
In the index at least, each entry would contain the full term text, encoded as
UTF-8. Probably the primary term dictionary would continue to use string
diffs.
That design offers no significant benefits other than those that flow from
compatibility with mmap: faster IndexReader open/reaopen, lower RAM usage
under multiple processes by way of buffer sharing. IO bandwidth requirements
and speed are probably a little better, but lookups on the term dictionary
index are not a significant search-time bottleneck.
Additionally, sort caches would be written at index time in three files, and
memory mapped as laid out in
<https://issues.apache.org/jira/browse/LUCENE-831?focusedCommentId=12656150#action_12656150>.
1) Stacked 64-bit file pointers.
2) Character data.
3) Doc num to ord mapping.
Marvin Humphrey
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]