Re: Idea for reclaimation algo

dormando Fri, 11 Apr 2014 11:09:21 -0700

s/pagging/padding/. gah.

On Fri, 11 Apr 2014, dormando wrote:


>
>
> On Fri, 11 Apr 2014, Slawomir Pryczek wrote:
>
> > Hi Dormando, more about the behaviour... when we're using "normal" 
> > memcached 1.4.13 16GB of memory gets exhausted in ~1h, then we start to have
> > almost instant evictions of needed items (again these items aren't really 
> > "needed" individually, just when many of them gets evicted it's
> > unacceptable because of how badly it affects the system)
>
> Almost instant evictions; so an item is stored, into a 16GB instance, and
> < 120 seconds later is bumped out of the LRU?
>
> You'll probably just ignore me again, but isn't this just slab imbalance?
> Once your instance fills up there're probably a few slab classes with way
> too little memory in them.
>
> 'stats slabs' shows you per-slab eviction rates, along with the last
> accessed time of an item when it was evicted. What does this look like on
> one of your full instances?
>
> The slab rebalance system lets you plug in your own algorithm by running
> the page reassignment commands manually. Then you can smooth out the pages
> to where you think they should be.
>
> You mention long and short TTL, but what are they exactly? 120s and an
> hour? A week?
>
> I understand your desire to hack up something to solve this, but as you've
> already seen scanning memory to remove expired items is problematic:
> you're either going to do long walks from the tail, use a background
> thread and walk a "probe" item through, or walk through random slab pages
> looking for expired memory. None of these are very efficient and tend to
> rely on luck.
>
> A better way to do this is to bucket the memory by TTL. You have lots of
> pretty decent options for this (and someone else already suggested one):
>
> - In your client, use different memcached pools for major TTL buckets (ie;
> one instance only gets long items, one only short). Make sure the slabs
> aren't imbalanced via the slab rebalancer.
>
> - Are the sizes of the items correlated with their TTL? Are 120s items
> always in a ~300 byte range and longer items tend to be in a different
> byte range? You could use length pagging to shunt them into specific slab
> classes, separating them internally at the cost of some ram efficiency.
>
> - A storage engine (god I wish we'd made 1.6 work...) which allows
> bucketing by TTL ranges. You'd want a smaller set of slab classes to not
> waste too much memory here, but the idea is the same as running multiple
> individual instances, except internally splitting the storage engine
> instead and storing everything in the same hash table.
>
> Those three options completely avoid latency problems, the first one
> requires no code modifications and will work very well. The third is the
> most work (and will be tricky due to things like slab rebalance, and none
> of the slab class identification code will work). I would avoid it unless
> I were really bored and wanted to maintain my own fork forever.
>
> > ~2 years ago i created another version based on that 1.4.13, than does 
> > garbage collection using custom stats handler. That version is able to be
> > running on half of the memory for like 2 weeks, with 0 evictions. But we 
> > gave it full 16G and just restart it each week to be sure memory usage is
> > kept in check, and we're not throwing away good data. Actually after 
> > changing -f1.25 to -f1.041 the slabs are filling with bad items much slower,
> > because items are distributed better and this custom eviction function is 
> > able to catch more expired data. We have like 200GB of data evicted this
> > way, daily. Because of volume (~40k req/s peak, much of it are writes) and 
> > differences in expire time LRU isn't able to reclaim items efficiently.
> >
> > Maybe people don't even realize the problem, but when we done some testing 
> > and turned off that "custom" eviction we had like 100% memory used with
> > 10% of waste reported by memcached admin. But then we run that custom 
> > eviction algorithm it turned out that 90% of memory is occupied by garbage.
> > Waste reported grew to 80% instantly after running unlimited "reclaim 
> > expired" on all items in the cache. So in "standard" client when people will
> > be using different expire times for items (we have it like 1minute minimum, 
> > 6h max)... they even won't be able to see how much memory they're
> > wasting in some specific cases, when they'll have many items that won't be 
> > hit after expiration, like we have.
> >
> > When using memcached as a buffer for mysql writes, we know exactly what to 
> > hit and when. Short TTL expired items, pile up near the head... long TTL
> > "live" items pile up near the tail and it's creating a barrier that 
> > prevents the LRU algo to reclaim almost anything, if im getting how it
> > currently works, correctly...
> >
> > >You made it sound like you had some data which never expired? Is this 
> > >true? 
> > Yes, i think because of how evictions are made (to be clear we're not 
> > setting non-expiring items). These short expiring items pile up in the front
> > of linked list, something that is supposed to live for eg. 120 or 180 
> > seconds is lingering in memory forever, untill we restart the cache... and
> > new items are killed almost instantly because there are no expired items on 
> > head.
> >
> > It's a special case, because after processing memory list, aggregating data 
> > and putting it in mysql these items are no longer touched. The list for
> > new time period will have completely different set of keys. As we use a 
> > prefix to generate all items in the list.
> >
> > $time_slice = floor( self::$time / 60) - $time_slices_back;
> > $prefix = ")ML){$list_id}-{$time_slice}";
> >
> > Again, not saying current implementation is bad... because it's fast and 
> > doesn't trash CPU cache when expire times are ~equal, that was probably
> > the idea... but we have not typical use case, which LRU isn't able to 
> > manage...
> >
> > Now im making ~same changes i made for .13... but for .17 and i want to 
> > make it working a little better ;)
> >
> >
> >
> > W dniu piątek, 11 kwietnia 2014 05:12:10 UTC+2 użytkownik Dormando napisał:
> >
> >       > Hey Dormando, thanks again for some comments... appreciate the help.
> >       >
> >       > Maybe i wasn't clear enough. I need only 1 minute persistence, and 
> > i can lose data sometimes, just i can't keep loosing data every
> >       minute due to
> >       > constant evictions caused by LRU. Actually i have just wrote that 
> > in my previous post. We're loosing about 1 minute of
> >       non-meaningfull data every
> >       > week because of restart that we do when memory starts to fill up 
> > (even with our patch reclaiming using linked list, we limit
> >       reclaiming to keep
> >       > speed better)... so the memory fills up after a week, not 30 
> > minutes...
> >
> >       Can you explain what you're seeing in more detail? Your data only 
> > needs to
> >       persist for 1 minute, but it's being evicted before 1 minute is up?
> >
> >       You made it sound like you had some data which never expired? Is this
> >       true?
> >
> >       If your instance is 16GB, takes a week to fill up, but data only 
> > needs to
> >       persist for a minute but isn't, something else is very broken? Or am I
> >       still misunderstanding you?
> >
> >       > Now im creating better solution, to limit locking as linked list is 
> > getting bigger.
> >       >
> >       > I explained what was worst implications of unwanted evictions (or 
> > loosing all data in cache) in my use case:
> >       > 1. loosing ~1 minute of non-significant data that's about to be 
> > stored in sql
> >       > 2. "flat" distribution of load to workers (not taking response 
> > times into account because stats reset).
> >       > 3. resorting to alternative targeting algorithm (with global, not 
> > local statistics).
> >       >
> >       > I never, ever said im going to write data that have to be 
> > persistent permanently. It's actually same idea as delayed write. If power
> >       fails you
> >       > loose 5s of data, but you can do 100x more writes. So you need the 
> > data to be persistent in memory, between writes the data **can't
> >       be lost**.
> >       > However you can lose it sometimes, that's the tradeoff that some 
> > people can make and some not. Obviously I can't keep loosing this
> >       data each
> >       > minute, because if i loose much it'll become meaningfull.
> >       >
> >       > Maybe i wasn't clear in that matter. I can loose all data even 20 
> > times a day. Sensitive data is stored using bulk update or
> >       transactions,
> >       > bypassing that "delayed write" layer. "0 evictions", that's the 
> > kind of "persistence" im going for. So items are persistent for some
> >       very short
> >       > periods of time (1-5 minutes) without being killed. It's just 
> > different use case. Running in production since 2 years, based on
> >       1.4.13, tested for
> >       > corectness, monitored so we have enough memory and 0 evictions 
> > (just reclaims)
> >       >
> >       > When i came here with same idea ~2 years ago you just said it's 
> > very stupid, now you even made me look like a moron :) And i can
> >       understand why you
> >       > don't want features that are not ~O(1) perfectly, but please don't 
> > get so personal about different ideas to do things and use cases,
> >       just because
> >       > these won't work for you.
> >       >
> >       >
> >       >
> >       >
> >       >
> >       > W dniu czwartek, 10 kwietnia 2014 20:53:12 UTC+2 użytkownik 
> > Dormando napisał:
> >       >       You really really really really really *must* not put data in 
> > memcached
> >       >       which you can't lose.
> >       >
> >       >       Seriously, really don't do it. If you need persistence, try 
> > using a redis
> >       >       instance for the persistent stuff, and use memcached for your 
> > cache stuff.
> >       >       I don't see why you feel like you need to write your own 
> > thing, there're a
> >       >       lot of persistent key/value stores (kyotocabinet/etc?). They 
> > have a much
> >       >       lower request ceiling and don't handle the LRU/cache pattern 
> > as well, but
> >       >       that's why you can use both.
> >       >
> >       >       Again, please please don't do it. You are damaging your 
> > company. You are a
> >       >       *danger* to your company.
> >       >
> >       >       On Thu, 10 Apr 2014, Slawomir Pryczek wrote:
> >       >
> >       >       > Hi Dormando, thanks for suggestions, background thread 
> > would be nice...
> >       >       > The idea is actually that with 2-3GB i get plenty of 
> > evictions of items that need to be fetched later. And with 16GB i still
> >       get
> >       >       evictions,
> >       >       > actually probably i could throw more memory than 16G and 
> > it'd only result in more expired items sitting in the middle of
> >       slabs,
> >       >       forever... Now im
> >       >       > going for persistence. Sounds probably crazy, but we're 
> > having some data that we can't loose:
> >       >       > 1. statistics, we aggregate writes to DB using memcached 
> > (+list implementation). If these items get evicted we're loosing
> >       rows in db.
> >       >       Loosing data
> >       >       > sometimes isn't a big problem. Eg. we restart memcached 
> > once a week so we're loosing 1 minute of data every week. But if we
> >       have
> >       >       evictions we're
> >       >       > loosing data constantly (which we can't have)
> >       >       > 2. we drive load balancer using data in memcached for 
> > statistics, again, not nice to loose data often because workers can get
> >       >       incorrect amount of
> >       >       > traffic.
> >       >       > 3. we're doing some adserving optimizations, eg. counting 
> > per-domain ad priority, for one domain it takes about 10 seconds to
> >       analyze
> >       >       all data and
> >       >       > create list of ads, so can't be done online... we put 
> > result of this in memcached, if we loose too much of this the system
> >       will start
> >       >       to serve
> >       >       > suboptimal ads (because it'll need to switch to more 
> > general data or much simpler algorithm that can be done instantly)
> >       >       >
> >       >       > Probably would be best to rewrite all this using C or 
> > golang, and use memcached just for caching, but it'd take too much time
> >       which
> >       >       we don't have
> >       >       > currently...
> >       >       >
> >       >       > I have seen twitter and nk implementations that seem to do 
> > what i need, but they seem old (based on old code), so I prefer to
> >       modify
> >       >       code of recent
> >       >       > "official" memcached, to not be stuck with old code or 
> > abandonware. Actually there are many topics about limitations of
> >       currrent
> >       >       eviction algo and
> >       >       > option to enable some background thread to do scraping 
> > based on statistics of most filled slabs (with some parameter to
> >       specify if it
> >       >       should take
> >       >       > light or aggressive approach) would be nice...
> >       >       >
> >       >       > As for the code... is that slab_rebalance_move function in 
> > slab.c? It seems a little difficult to gasp without some DOCs of
> >       how
> >       >       things are
> >       >       > working... can you please write a very short description of 
> > how this "angry birds" more workd?
> >       >
> >       >       Look at doc/protocol.txt for explanations of the slab move 
> > options. the
> >       >       names are greppable back to the source.
> >       >
> >       >       > I have quick question about this above... linked is item 
> > that's placed on linked list, but what other flags means, and why 2
> >       last are
> >       >       2 of them
> >       >       > temporary?
> >       >       > #define ITEM_LINKED 1
> >       >       > #define ITEM_CAS 2
> >       >       >
> >       >       > /* temp */
> >       >       > #define ITEM_SLABBED 4
> >       >       > #define ITEM_FETCHED 8
> >       >       >
> >       >       > This from slab_rebalance_move seems interesting:
> >       >       > refcount = refcount_incr(&it->refcount);
> >       >       > ...
> >       >       > if (refcount == 1) { /* item is unlinked, unused */
> >       >       > ...
> >       >       > } else if (refcount == 2) { /* item is linked but not busy 
> > */
> >       >       >
> >       >       > Is there some docs about refcounts, locks and item states? 
> > Basically why item with refcount 2 is not busy? You're increasing
> >       refcount
> >       >       by 1 on
> >       >       > select, then again when reading data? Can refcount ever be 
> > higher than 2 (3 in above case), meaning 2 threads can access same
> >       item?
> >       >
> >       >       The comment on the same line is explaining exactly what it 
> > means.
> >       >
> >       >       Unfortunately it's a bit of a crap shoot. I think I wrote a 
> > threads
> >       >       explanation somewhnere (some release notes, or in a file in 
> > there, I can't
> >       >       quite remember offhand). Since scaling the thread code it got 
> > a lot more
> >       >       complicated. You have to be extremely careful under what 
> > circumstances you
> >       >       access items (you must hold an item lock + the refcount must 
> > be 2 if you
> >       >       want to unlink it).
> >       >
> >       >       You'll just have to study it a bit, sorry. Grep around to see 
> > where the
> >       >       flags are used.
> >       >
> >       >       > Thanks.
> >       >       >
> >       >       > W dniu czwartek, 10 kwietnia 2014 06:05:30 UTC+2 użytkownik 
> > Dormando napisał:
> >       >       >       > Hi Guys,
> >       >       >       > im running a specific case where i don't want 
> > (actually can't have) to have evicted items (evictions = 0 ideally)...
> >       now i
> >       >       have
> >       >       >       created some simple
> >       >       >       > algo that lock the cache, goes through linked list 
> > and evicts items... it makes some problems, like 10-20ms cache
> >       locks on
> >       >       some
> >       >       >       cases.
> >       >       >       >
> >       >       >       > Now im thinking about going through each slab 
> > memory (slabs keep a list of allocated memory regions) ... looking for
> >       items,
> >       >       if
> >       >       >       expired item is
> >       >       >       > found, evict it... this way i can go eg. 10k items 
> > or 1MB of memory at a time + pick slabs with high utilization and
> >       run this
> >       >       >       "additional" eviction
> >       >       >       > only on them... so it'll prevent allocating memory 
> > just because unneded data with short TTL is occupying HEAD of the
> >       list.
> >       >       >       >
> >       >       >       > With this linked list eviction im able to run on 
> > 2-3GB of memory... without it 16GB of memory is exhausted in 1-2h
> >       and then
> >       >       memcached
> >       >       >       starts to
> >       >       >       > kill "good" items (leaving expired ones wasting 
> > memory)...
> >       >       >       >
> >       >       >       > Any comments?
> >       >       >       > Thanks.
> >       >       >
> >       >       >       you're going a bit against the base algorithm. if 
> > stuff is falling out of
> >       >       >       16GB of memory without ever being utilized again, why 
> > is that critical?
> >       >       >       Sounds like you're optimizing the numbers instead of 
> > actually tuning
> >       >       >       anything useful.
> >       >       >
> >       >       >       That said, you can probably just extend the slab 
> > rebalance code. There's a
> >       >       >       hook in there (which I called "Angry birds mode") 
> > that drives a slab
> >       >       >       rebalance when it'd otherwise run an eviction. That 
> > code already safely
> >       >       >       walks the slab page for unlocked memory and frees it; 
> > you could edit it
> >       >       >       slightly to check for expiration and then freelist it 
> > into the slab class
> >       >       >       instead.
> >       >       >
> >       >       >       Since it's already a background thread you could 
> > further modify it to just
> >       >       >       wake up and walk pages for stuff to evict.
> >       >       >
> >       >       > --
> >       >       >
> >       >       > ---
> >       >       > You received this message because you are subscribed to the 
> > Google Groups "memcached" group.
> >       >       > To unsubscribe from this group and stop receiving emails 
> > from it, send an email to [email protected].
> >       >       > For more options, visit https://groups.google.com/d/optout.
> >       >       >
> >       >       >
> >       >
> >       > --
> >       >
> >       > ---
> >       > You received this message because you are subscribed to the Google 
> > Groups "memcached" group.
> >       > To unsubscribe from this group and stop receiving emails from it, 
> > send an email to [email protected].
> >       > For more options, visit https://groups.google.com/d/optout.
> >       >
> >       >
> >
> > --
> >
> > ---
> > You received this message because you are subscribed to the Google Groups 
> > "memcached" group.
> > To unsubscribe from this group and stop receiving emails from it, send an 
> > email to [email protected].
> > For more options, visit https://groups.google.com/d/optout.
> >
> >
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Idea for reclaimation algo

Reply via email to