Re: Realtime Search

Marvin Humphrey Wed, 24 Dec 2008 11:03:31 -0800

On Tue, Dec 23, 2008 at 11:02:56PM -0600, robert engels wrote:
> Seems doubtful you will be able to do this without increasing the  
> index size dramatically. Since it will need to be stored  
> "unpacked" (in order to have random access), yet the terms are  
> variable length - leading to using a maximum=minimum size for every  
> term.

Wow. That's a spectacularly awful design. Its worst case -- one outlier
term, say, 1000 characters in length, in a field where the average term length
is in the single digits -- would explode the index size and incur wasteful IO
overhead, just as you say.

Good thing we've never considered it. :)

I'm hoping we can improve on this, but for now, we've ended up at a two-file
design for the term dictionary index.

1) Stacked 64-bit file pointers.
2) Variable length character and term info data, interpreted using a
pluggable codec.

In the index at least, each entry would contain the full term text, encoded as
UTF-8. Probably the primary term dictionary would continue to use string
diffs.

That design offers no significant benefits other than those that flow from
compatibility with mmap: faster IndexReader open/reaopen, lower RAM usage
under multiple processes by way of buffer sharing. IO bandwidth requirements
and speed are probably a little better, but lookups on the term dictionary
index are not a significant search-time bottleneck.

Additionally, sort caches would be written at index time in three files, and
memory mapped as laid out in
<https://issues.apache.org/jira/browse/LUCENE-831?focusedCommentId=12656150#action_12656150>.

1) Stacked 64-bit file pointers.
2) Character data.
3) Doc num to ord mapping.

Marvin Humphrey

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Realtime Search

Reply via email to