Seems doubtful you will be able to do this without increasing the
index size dramatically. Since it will need to be stored
"unpacked" (in order to have random access), yet the terms are
variable length - leading to using a maximum=minimum size for every
term.
In the end I highly doubt it will make much difference in speed -
here's several reasons why...
1. with "fixed" size terms, the additional IO (larger pages) probably
offsets a lot of the random access benefit. This is why "compressed"
disks on a fast machine (CPU) are often faster than "uncompressed" -
more data is read during every IO access.
2. with a reopen, only new segments are "read", and since it is a new
segment, it is most likely already in the disk cache (from the
write), so the reopen penalty is negligible (especially if the term
index "skip to" is written last).
3. If the reopen is after an optimize - when the OS cache has
probably been obliterated, then the warm up time is going to be
similar in most cases anyway, since the "index" pages will also not
be in core (in the case of memory mapped files). Again, writing the
"skip to" last can help with this.
Just because a file is memory mapped does not mean its pages will
have an greater likelihood to be in the cache. The locality of
reference is going to control this, just as the most/often access
controls it in the OS disk cache. Also, most OSs will take real
memory from the virtual address space and add it to the disk cache if
the process is doing lots of IO.
If you have a memory mapped "term index", you are still going to need
to perform a binary search to find the correct term "page", and after
an optimize the visited pages will not be in the cache (or in core).
On Dec 23, 2008, at 9:20 PM, Marvin Humphrey wrote:
On Tue, Dec 23, 2008 at 08:36:24PM -0600, robert engels wrote:
Is there something that I am missing?
Yes.
I see lots of references to using "memory mapped" files to
"dramatically"
improve performance.
There have been substantial discussions about this design in JIRA,
notably LUCENE-1458.
The "dramatic" improvement is WRT to opening/reopening an IndexReader.
Presently in both KS and Lucene, certain data structures have to be
read at
IndexReader startup and unpacked into process memory -- in
particular, the
term dictionary index and sort caches. If those data structures
can be
represented by a memory mapped file rather than built up from
scratch, we save
big.
Marvin Humphrey
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org