In discussions of Lucene search performance, the importance of OS caching of index data is frequently mentioned. The typical recommendation is to keep plenty of unallocated RAM available (e.g. don't gobble it all up with your JVM heap) and try to avoid large I/O operations that would purge the OS cache.
I'm curious if anyone has thought about (or even tried) caching the low-level index data in Java, rather than in the OS. For example, at the IndexInput level there could be an LRU cache of byte[] blocks, similar to how a RDBMS caches index pages. (Conveniently, BufferedIndexInput already reads in 1k chunks.) You would reverse the advice above and instead make your JVM heap as large as possible (or at least large enough to achieve a desired speed/space tradeoff). This approach seems like it would have some advantages: - Explicit control over how much you want cached (adjust your JVM heap and cache settings as desired) - Cached index data won't be purged by the OS doing other things - Index warming might be faster, or at least more predictable The obvious disadvantage for some situations is that more RAM would now be tied up by the JVM, rather than managed dynamically by the OS. Any thoughts? It seems like this would be pretty easy to implement (subclass FSDirectory, return subclass of FSIndexInput that checks the cache before reading, cache keyed on filename + position), but maybe I'm oversimplifying, and for that matter a similar implementation may already exist somewhere for all I know. Thanks, Chris