I think this approach would make sense only in certain situations.  If
you're running different things with different memory requirements on the
same server, then the OS is probably the most efficient judge of what needs
to be in RAM.  However, if (like us) the server is running a big Java app
using Lucene and nothing else, then maintaining your own cache seems useful,
for the reasons I mentioned earlier.

Mike, the question you raise is whether (or to what degree) the OS will swap
out app memory in favor of IO cache.  I don't know anything about how the
Linux kernel makes those decisions, but I guess I had hoped that (regardless
of the swappiness setting) it would be less likely to swap out application
memory for IO, than it would be to replace some cached IO data with some
different cached IO data.  The latter case is what kills Lucene performance
when you've got a lot of index data in the IO cache and a file copy or some
other operation replaces it all with something else: the OS has no way of
knowing that some IO cache is more desirable long-term than other IO cache.
The former case (swapping app for IO cache) makes sense, I suppose, if the
app memory hasn't been used in a long time, but with an LRU cache you should
be hitting those pages pretty frequently by definition.  But if it does swap
out your Java cache for something else, you're probably no worse off than
before, right?  In this case you have to hit the disk to fault in the
paged-out cache; in the original case you have to hit the disk to read the
index data that's not in IO cache.

Anyway, the interaction between these things (virtual memory, IO cache,
disk, JVM, garbage collection, etc.) are complex and so the optimal
configuration is very usage-dependent.  The current Lucene behavior seems to
be the most flexible.  When/if I get a chance to try the Java caching for
our situation I'll report the results.

On Wed, Jul 22, 2009 at 12:37 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> I think it's a neat idea!
>
> But you are in fact fighting the OS so I'm not sure how well this'll
> work in practice.
>
> EG the OS will happily swap out pages from your process if it thinks
> you're not using them, so it'd easily swap out your cache in favor of
> its own IO cache (this is the "swappiness" configuration on Linux),
> which would then kill performance (take a page hit when you finally
> did need to use your cache).  In C (possibly requiring root) you could
> wire the pages, but we can't do that from javaland, so it's already
> not a fair fight.
>
> Mike
>
> > On Wed, Jul 22, 2009 at 6:21 PM, Nigel <nigelspl...@gmail.com> wrote:
> >>
> >> In discussions of Lucene search performance, the importance of OS
> caching
> >> of index data is frequently mentioned.  The typical recommendation is to
> >> keep plenty of unallocated RAM available (e.g. don't gobble it all up
> with
> >> your JVM heap) and try to avoid large I/O operations that would purge
> the OS
> >> cache.
> >>
> >> I'm curious if anyone has thought about (or even tried) caching the
> >> low-level index data in Java, rather than in the OS.  For example, at
> the
> >> IndexInput level there could be an LRU cache of byte[] blocks, similar
> to
> >> how a RDBMS caches index pages.  (Conveniently, BufferedIndexInput
> already
> >> reads in 1k chunks.) You would reverse the advice above and instead make
> >> your JVM heap as large as possible (or at least large enough to achieve
> a
> >> desired speed/space tradeoff).
> >>
> >> This approach seems like it would have some advantages:
> >>
> >> - Explicit control over how much you want cached (adjust your JVM heap
> and
> >> cache settings as desired)
> >> - Cached index data won't be purged by the OS doing other things
> >> - Index warming might be faster, or at least more predictable
> >>
> >> The obvious disadvantage for some situations is that more RAM would now
> be
> >> tied up by the JVM, rather than managed dynamically by the OS.
> >>
> >> Any thoughts?  It seems like this would be pretty easy to implement
> >> (subclass FSDirectory, return subclass of FSIndexInput that checks the
> cache
> >> before reading, cache keyed on filename + position), but maybe I'm
> >> oversimplifying, and for that matter a similar implementation may
> already
> >> exist somewhere for all I know.
>
>
>
>

Reply via email to