On Mon, Jun 29, 2009 at 6:28 AM, Michael McCandless < luc...@mikemccandless.com> wrote:
> On Sun, Jun 28, 2009 at 9:08 PM, Nigel<nigelspl...@gmail.com> wrote: > >> Unfortunately the TermInfos must still be hit to look up the > >> freq/proxOffset in the postings files. > > > > But for that data you only have to hit the TermInfos for the terms you're > > searching, correct? So, assuming that there are vastly more terms in the > > index than you ever actually use in a query, we could rely on the LRU > cache > > to keep the queried TermInfos around, rather than loading all of them > > up-front. This was a hypothesis based on some tracing through the code > but > > not a lot of knowledge of Lucene internals, so please steer me back to > > reality if necessary. (-: > > Right, it's only the terms in your query that need to be looked up. > > There's already an LRU cache for the Term -> TermInfos lookup, in > TermInfosReader (hard wired to size 1024). It was created as a "short > term" cache so that queries that look up the same term twice (eg once > for idf and once to get freq/prox postings position), would result in > only one real lookup. You could try privately experimenting w/ that > value to see if you can get a reasonable it rate? Exactly, the TermInfosReader cache is what I was thinking of. Thanks for the confirmation; I'll give that a try. Lucene only loads the "indexed" terms (= every 128th) into RAM, by default. Ah, I was confused by the index divisor being 1 by default: I thought it meant that all terms were being loaded. I see now in SegmentTermEnum that the every-128th behavior is implemented at a lower level. But I'm even more confused about why we have so many terms in memory. A heap dump shows over 270 million TermInfos, so if that's only 128th of the total then we REALLY have a lot of terms. (-: We do have a lot of docs (about 250 million), and we do have a couple unique per-document values, but even so I can't see how we could get to 270 million x 128 terms. (The heap dump numbers are stable across the index close-and-reopen cycle, so I don't think we're leaking.) Thanks, Chris