Re: Java caching of low-level index data?

eks dev Wed, 22 Jul 2009 12:39:58 -0700

>Part of the challenge here is what metric is really important.
Sure, depends who you ask :) Lucene is so popular, that you can find almost 
every pattern we could come up with.


funny, I had to deal with similar situation. The simplest solution was to set 
warm-up with constructed Queries (from hi-freq terms) well before users start 
shooting... everybody was happy, latency of user requests and OS... even 
funnier, we do it even today with RAMDisk, not to fight OS for RAM, but to 
pre-populate out own app-specific caches after updates/restarts... Good warm-up 
tackles a lot of these problems and is not difficult to do it    





----- Original Message ----
> From: Michael McCandless <[email protected]>
> To: [email protected]
> Sent: Wednesday, 22 July, 2009 21:03:00
> Subject: Re: Java caching of low-level index data?
> 
> Part of the challenge here is what metric is really important.
> 
> Eg, as extreme example, imagine a machine that does searching but also
> does other things.  The search is not heavily used; in fact people
> only run searches from 9 to 5.  So overnight, the OS notices the
> search isn't using the RAM at all, and it happily swaps it out and gives
> it to other processes, uses it for IO cache, etc.
> 
> So then, the first few searches at 9 AM the next day are really slow
> as everything gets swapped back in.
> 
> From the OS's standpoint, with the goal of maximizing overall
> efficient utilization of the resources, swapping the pages out made
> sense, because all these processes overnight ran much more
> efficiently.  A few sluggish morning searches was a small price to
> pay.
> 
> But if consistency of search latency is important, you don't want the
> OS to ever do that, and you really need to tune swappiness down, wire
> the pages, pre-load your own caches, etc.
> 
> Mike
> 
> On Wed, Jul 22, 2009 at 1:19 PM, eks devwrote:
> >
> > this should not be all that difficult to try. I accept it makes sense in 
> > some 
> cases ... but which ones?
> > Background: all my attempts to fight OS went bed :(
> >
> > Let us think again what does it mean what Mike gave as an example?
> >
> > You are explicitly deciding that Lucene should get bigger share of RAM. OS 
> will unload these pages
> >  if OS needs Lucene  RAM for "something else" and you are not using them. 
> Right?
> >
> > If "something else" should get less resources, we are on target, but this 
> > is 
> end result. For any shared setup where you have many things that run, this 
> decision has its consequences, "something else" is going to be starved.
> >
> > The other case, where only lucene runs, well what is the difference if we 
> evict unused pages or OS does it (better control is just what we get on 
> benefit)? This is the case where you are anyhow in "not really comfortable 
> for 
> real caching" situation, otherwise even greedy OSs wouldn't swap (at least my 
> experience with reasonably configured OSs)...
> >
> > after thinking about it again, I would say, yes, there are for sure some 
> > cases 
> where it helps, but not many cases and even in these cases benefit will be 
> small.
> >
> > I guess :)
> >
> >
> >
> >
> >
> >
> > ----- Original Message ----
> >> From: Michael McCandless 
> >> To: [email protected]
> >> Sent: Wednesday, 22 July, 2009 18:37:19
> >> Subject: Re: Java caching of low-level index data?
> >>
> >> I think it's a neat idea!
> >>
> >> But you are in fact fighting the OS so I'm not sure how well this'll
> >> work in practice.
> >>
> >> EG the OS will happily swap out pages from your process if it thinks
> >> you're not using them, so it'd easily swap out your cache in favor of
> >> its own IO cache (this is the "swappiness" configuration on Linux),
> >> which would then kill performance (take a page hit when you finally
> >> did need to use your cache).  In C (possibly requiring root) you could
> >> wire the pages, but we can't do that from javaland, so it's already
> >> not a fair fight.
> >>
> >> Mike
> >>
> >> On Wed, Jul 22, 2009 at 11:56 AM, eks devwrote:
> >> > imo, it is too low level to do it better than OSs. I agree, cache 
> >> > unloading
> >> > effect would be prevented with it, but I am not sure if it brings net-net
> >> > benefit, you would get this problem fixed, but probably OS would kill you
> >> > anyhow (you took valuable memory from OS) on queries that miss your 
> internal
> >> > cache...
> >> >
> >> > We could try to do better if we put more focus on higher levels and do 
> >> > the
> >> > caching there... maybe even cache somhow some CPU work, e.g.  keep dense
> >> > Postings in "faster, less compressed" format, load TermDictionary into
> >> > RAMDirectory and keep the rest on disk.. Ideas in that direction have 
> better
> >> > chance to bring us forward. Take for example FuzzyQuery, there you can do
> >> > some LRU caching at Term level and and save huge amounts of IO and CPU...
> >> >
> >> >
> >> >
> >> >
> >> > From: Shai Erera
> >> > To: [email protected]
> >> > Sent: Wednesday, 22 July, 2009 17:32:34
> >> > Subject: Re: Java caching of low-level index data?
> >> >
> >> > That's an interesting idea.
> >> >
> >> > I always wonder however how much exactly would we gain, vs. the effort 
> spent
> >> > to develop, debug and maintain it. Just some thoughts that we should
> >> > consider regarding this:
> >> >
> >> > * For very large indices, where we think this will generally be good 
> >> > for, I
> >> > believe it's reasonable to assume that the search index will sit on its 
> >> > own
> >> > machine, or set of CPUs, RAM and HD. Therefore given that very few will 
> >> > run
> >> > on the OS other than the search index, I assume the OS cache will be 
> >> > enough
> >> > (if not better)?
> >> >
> >> > * In other cases, where the search app runs together w/ other apps, I'm 
> >> > not
> >> > sure how much we'll gain. I can assume such apps will use a smaller 
> >> > index,
> >> > or will not need to support high query load? If so, will they really 
> >> > care 
> if
> >> > we cache their data, vs. the OS?
> >> >
> >> > Like I said, these are just thoughts. I don't mean to cancel the idea w/
> >> > them, just to think how much will it improve performance (vs. maybe even
> >> > hurt it?). Often I find it that some optimizations that are done will
> >> > benefit very large indices. But these usually get their decent share of
> >> > resources, and the JVM itself is run w/ larger heap etc. So these
> >> > optimizations turn out to not affect such indices much after all. And for
> >> > smaller indices, performance is usually not a problem (well ... they 
> >> > might
> >> > just fit entirely in RAM).
> >> >
> >> > Shai
> >> >
> >> > On Wed, Jul 22, 2009 at 6:21 PM, Nigel wrote:
> >> >>
> >> >> In discussions of Lucene search performance, the importance of OS 
> >> >> caching
> >> >> of index data is frequently mentioned.  The typical recommendation is to
> >> >> keep plenty of unallocated RAM available (e.g. don't gobble it all up 
> >> >> with
> >> >> your JVM heap) and try to avoid large I/O operations that would purge 
> >> >> the 
> OS
> >> >> cache.
> >> >>
> >> >> I'm curious if anyone has thought about (or even tried) caching the
> >> >> low-level index data in Java, rather than in the OS.  For example, at 
> >> >> the
> >> >> IndexInput level there could be an LRU cache of byte[] blocks, similar 
> >> >> to
> >> >> how a RDBMS caches index pages.  (Conveniently, BufferedIndexInput 
> >> >> already
> >> >> reads in 1k chunks.) You would reverse the advice above and instead make
> >> >> your JVM heap as large as possible (or at least large enough to achieve 
> >> >> a
> >> >> desired speed/space tradeoff).
> >> >>
> >> >> This approach seems like it would have some advantages:
> >> >>
> >> >> - Explicit control over how much you want cached (adjust your JVM heap 
> >> >> and
> >> >> cache settings as desired)
> >> >> - Cached index data won't be purged by the OS doing other things
> >> >> - Index warming might be faster, or at least more predictable
> >> >>
> >> >> The obvious disadvantage for some situations is that more RAM would now 
> >> >> be
> >> >> tied up by the JVM, rather than managed dynamically by the OS.
> >> >>
> >> >> Any thoughts?  It seems like this would be pretty easy to implement
> >> >> (subclass FSDirectory, return subclass of FSIndexInput that checks the 
> cache
> >> >> before reading, cache keyed on filename + position), but maybe I'm
> >> >> oversimplifying, and for that matter a similar implementation may 
> >> >> already
> >> >> exist somewhere for all I know.
> >> >>
> >> >> Thanks,
> >> >> Chris
> >> >
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]





---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Java caching of low-level index data?

Reply via email to