Re: Realtime Search

robert engels Wed, 24 Dec 2008 08:51:38 -0800

Thinking about this some more, you could use fixed length pages forthe term index, with a page header containing a count of entries, anduse key compression (to avoid the constant entry size).

The problem with this is that you still have to decode the entries(slowing the processing - since a simple binary search on the page isnot possible).

But, if you also add a 'least term and greatest term' to the pageheader (you can avoid the duplicate storage of these entries aswell), you can perform a binary search of the term index much faster.You only need to decode the index page containing (maybe) the desiredentry.

If you were doing a prefix/range search, you will still end updecoding lots of pages...

This is why a database has their own page cache, and usually cachesthe decoded form (for index pages) for faster processing - at theexpense of higher memory usage. Usually data pages are not cached inthe decoded/uncompressed form. In most cases the database vendor willrecommend removing the OS page cache on the database server, andallocating all of the memory to the database process.

You may be able to avoid some of the warm-up of an index using memorymapped files, but with proper ordering of the writing of the index,it probably isn't necessary. Beyond that, processing the term indexdirectly using NIO does not appear that it will be faster than usingan in-process cache of the term index (similar to the skip-to memoryindex now).

The BEST approach is probably to have the index writer build thememory "skip to" structure as it writes the segment, and then includethis in the segment during the reopen - no warming required !. Aslong as the reader and writer are in the same process, it will be awinner !


On Dec 23, 2008, at 11:02 PM, robert engels wrote:

Seems doubtful you will be able to do this without increasing theindex size dramatically. Since it will need to be stored"unpacked" (in order to have random access), yet the terms arevariable length - leading to using a maximum=minimum size for everyterm.
In the end I highly doubt it will make much difference in speed -here's several reasons why...
1. with "fixed" size terms, the additional IO (larger pages)probably offsets a lot of the random access benefit. This is why"compressed" disks on a fast machine (CPU) are often faster than"uncompressed" - more data is read during every IO access.
2. with a reopen, only new segments are "read", and since it is anew segment, it is most likely already in the disk cache (from thewrite), so the reopen penalty is negligible (especially if the termindex "skip to" is written last).
3. If the reopen is after an optimize - when the OS cache hasprobably been obliterated, then the warm up time is going to besimilar in most cases anyway, since the "index" pages will also notbe in core (in the case of memory mapped files). Again, writing the"skip to" last can help with this.
Just because a file is memory mapped does not mean its pages willhave an greater likelihood to be in the cache. The locality ofreference is going to control this, just as the most/often accesscontrols it in the OS disk cache. Also, most OSs will take realmemory from the virtual address space and add it to the disk cacheif the process is doing lots of IO.
If you have a memory mapped "term index", you are still going toneed to perform a binary search to find the correct term "page",and after an optimize the visited pages will not be in the cache(or in core).
On Dec 23, 2008, at 9:20 PM, Marvin Humphrey wrote:
On Tue, Dec 23, 2008 at 08:36:24PM -0600, robert engels wrote:
Is there something that I am missing?
Yes.
I see lots of references to using "memory mapped" files to"dramatically"
improve performance.
There have been substantial discussions about this design in JIRA,
notably LUCENE-1458.
The "dramatic" improvement is WRT to opening/reopening anIndexReader.Presently in both KS and Lucene, certain data structures have tobe read atIndexReader startup and unpacked into process memory -- inparticular, theterm dictionary index and sort caches. If those data structurescan berepresented by a memory mapped file rather than built up fromscratch, we save
big.

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Realtime Search

Reply via email to