Re: Lucene memory usage

Jason Rutherglen Thu, 11 Jun 2009 12:40:08 -0700

Maybe we can put together our requested IO operations and submit them for
inclusion in NIO Java 7?  http://openjdk.java.net/projects/nio/


On Thu, Jun 11, 2009 at 12:21 PM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:

> Makes sense.
>
> Currently MMapDirectory doesn't write using mapped byte buffers,
> would the memory management of the OS behave differently if we
> were writing to the MMapped bytebuffers as opposed to writing to
> an RAF (like with FSDir)?
>
> > it's blind LRU approach is often a poor policy (eg for terms
> dict, where a binary search could easily suddenly need to visit
> a random rarely accessed page).
>
> Agreed it's not the best for termDict.
>
> > Well... locality is still important. Under the hood, mmap on a
> page miss must hit the disk.
>
> Maybe this is where MappedByteBuffer.load as Earwin has
> mantioned comes in handy?
>
> But yeah, we can't do anything with this unless we had a JNI
> library that interacts more directly with the IO system
> (allowing us to configure whether IO is cached etc), which
> perhaps exists or could exist in the future (or Java7?).
>
>
>
> On Thu, Jun 11, 2009 at 2:43 AM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> On Wed, Jun 10, 2009 at 9:24 PM, Jason
>> Rutherglen<jason.rutherg...@gmail.com> wrote:
>> > I read over the LUCENE-1458 comments again. Interesting. I think
>> > the most compelling argument is that the various files we're
>> > normally loading into the heap are, after merging, in the IO
>> > cache. If we can simply reuse the IO cache rather then allocate
>> > a bunch of redundant arrays in heap, we could be better off? I
>> > think this is very compelling for field caches, delDocs, and
>> > bitsets that are tied to segments and loaded after each merge.
>>
>> The OS doesn't have enough information to "know" what data structures
>> are important to Lucene (must stay hot) and which are less so.  It's
>> blind LRU approach is often a poor policy (eg for terms dict, where a
>> binary search could easily suddenly need to visit a random rarely
>> accessed page).
>>
>> For example, after merging, all the segments we just *read* from will
>> also be hot, having flushed out other important pages from the IO
>> cache, which is very much not what we want to do.  From C, and per-OS,
>> you can inform the OS that it should not cache the bytes read from the
>> file, but from Java we just can't control that.
>>
>> > I think it's possible to write some basic benchmarks to test a
>> > byte[] BitVector vs.a MappedByteBuffer BitVector and see what
>> > happens.
>>
>> Yes, but this is challenging to test properly.  On systems with plenty
>> of RAM, the approaches should be similarly fast.  On systems starved
>> for RAM, both approaches should thrash miserably.  It's the cases in
>> between that we need to test for.
>>
>> > The other potentially interesting angle here is in regards to
>> > realtime updates, where we can implement a MMaped page type of
>> > system so blocks of this stuff can be updated in near realtime,
>> > directly in the MMaped space (similar to how in heap land with
>> > LUCENE-1526 we're looking at breaking up the byte[] into a
>> > byte[][]).
>>
>> But carrying such updates via RAM, like we do now for deletions,
>> should generally be more performant (you never have to put the changes
>> on disk).
>>
>> > Also if we assume data is MMaped I don't think it matters as much if
>> > the updates on disk are not in sequence? (Whereas today we try
>> > to keep all our files sequentially readable optimized). Of
>> > course I could be completely wrong. :)
>>
>> Well... locality is still important.  Under the hood, mmap on a page
>> miss must hit the disk.
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>

Re: Lucene memory usage

Reply via email to