On Wed, Jun 10, 2009 at 9:24 PM, Jason Rutherglen<jason.rutherg...@gmail.com> wrote: > I read over the LUCENE-1458 comments again. Interesting. I think > the most compelling argument is that the various files we're > normally loading into the heap are, after merging, in the IO > cache. If we can simply reuse the IO cache rather then allocate > a bunch of redundant arrays in heap, we could be better off? I > think this is very compelling for field caches, delDocs, and > bitsets that are tied to segments and loaded after each merge.
The OS doesn't have enough information to "know" what data structures are important to Lucene (must stay hot) and which are less so. It's blind LRU approach is often a poor policy (eg for terms dict, where a binary search could easily suddenly need to visit a random rarely accessed page). For example, after merging, all the segments we just *read* from will also be hot, having flushed out other important pages from the IO cache, which is very much not what we want to do. From C, and per-OS, you can inform the OS that it should not cache the bytes read from the file, but from Java we just can't control that. > I think it's possible to write some basic benchmarks to test a > byte[] BitVector vs.a MappedByteBuffer BitVector and see what > happens. Yes, but this is challenging to test properly. On systems with plenty of RAM, the approaches should be similarly fast. On systems starved for RAM, both approaches should thrash miserably. It's the cases in between that we need to test for. > The other potentially interesting angle here is in regards to > realtime updates, where we can implement a MMaped page type of > system so blocks of this stuff can be updated in near realtime, > directly in the MMaped space (similar to how in heap land with > LUCENE-1526 we're looking at breaking up the byte[] into a > byte[][]). But carrying such updates via RAM, like we do now for deletions, should generally be more performant (you never have to put the changes on disk). > Also if we assume data is MMaped I don't think it matters as much if > the updates on disk are not in sequence? (Whereas today we try > to keep all our files sequentially readable optimized). Of > course I could be completely wrong. :) Well... locality is still important. Under the hood, mmap on a page miss must hit the disk. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org