In discussions of Lucene search performance, the importance of OS caching of
index data is frequently mentioned.  The typical recommendation is to keep
plenty of unallocated RAM available (e.g. don't gobble it all up with your
JVM heap) and try to avoid large I/O operations that would purge the OS
cache.

I'm curious if anyone has thought about (or even tried) caching the
low-level index data in Java, rather than in the OS.  For example, at the
IndexInput level there could be an LRU cache of byte[] blocks, similar to
how a RDBMS caches index pages.  (Conveniently, BufferedIndexInput already
reads in 1k chunks.) You would reverse the advice above and instead make
your JVM heap as large as possible (or at least large enough to achieve a
desired speed/space tradeoff).

This approach seems like it would have some advantages:

- Explicit control over how much you want cached (adjust your JVM heap and
cache settings as desired)
- Cached index data won't be purged by the OS doing other things
- Index warming might be faster, or at least more predictable

The obvious disadvantage for some situations is that more RAM would now be
tied up by the JVM, rather than managed dynamically by the OS.

Any thoughts?  It seems like this would be pretty easy to implement
(subclass FSDirectory, return subclass of FSIndexInput that checks the cache
before reading, cache keyed on filename + position), but maybe I'm
oversimplifying, and for that matter a similar implementation may already
exist somewhere for all I know.

Thanks,
Chris

Reply via email to