Re: Idea for reclaimation algo

dormando Sat, 12 Apr 2014 23:53:36 -0700

Yes :) If I recall how that works, it's mildly similar to a few other
things I've seen. Not super trivial to implement in a short period of time
though.


On Sat, 12 Apr 2014, Ryan McElroy wrote:

> Facebook implemented a "visitor" plugin system that we use to kick out 
> already-expired items in our memcached instances. It runs at low priority
> and doesn't cause much latency that we notice. I should really get our 
> version back out there so that others can see how we did it and implement it
> in the legit memcached :-)
> ~Ryan
>
>
> On Fri, Apr 11, 2014 at 11:08 AM, dormando <[email protected]> wrote:
>       s/pagging/padding/. gah.
>
>       On Fri, 11 Apr 2014, dormando wrote:
>
>       >
>       >
>       > On Fri, 11 Apr 2014, Slawomir Pryczek wrote:
>       >
>       > > Hi Dormando, more about the behaviour... when we're using "normal" 
> memcached 1.4.13 16GB of memory gets exhausted in ~1h, then we
>       start to have
>       > > almost instant evictions of needed items (again these items aren't 
> really "needed" individually, just when many of them gets
>       evicted it's
>       > > unacceptable because of how badly it affects the system)
>       >
>       > Almost instant evictions; so an item is stored, into a 16GB instance, 
> and
>       > < 120 seconds later is bumped out of the LRU?
>       >
>       > You'll probably just ignore me again, but isn't this just slab 
> imbalance?
>       > Once your instance fills up there're probably a few slab classes with 
> way
>       > too little memory in them.
>       >
>       > 'stats slabs' shows you per-slab eviction rates, along with the last
>       > accessed time of an item when it was evicted. What does this look 
> like on
>       > one of your full instances?
>       >
>       > The slab rebalance system lets you plug in your own algorithm by 
> running
>       > the page reassignment commands manually. Then you can smooth out the 
> pages
>       > to where you think they should be.
>       >
>       > You mention long and short TTL, but what are they exactly? 120s and an
>       > hour? A week?
>       >
>       > I understand your desire to hack up something to solve this, but as 
> you've
>       > already seen scanning memory to remove expired items is problematic:
>       > you're either going to do long walks from the tail, use a background
>       > thread and walk a "probe" item through, or walk through random slab 
> pages
>       > looking for expired memory. None of these are very efficient and tend 
> to
>       > rely on luck.
>       >
>       > A better way to do this is to bucket the memory by TTL. You have lots 
> of
>       > pretty decent options for this (and someone else already suggested 
> one):
>       >
>       > - In your client, use different memcached pools for major TTL buckets 
> (ie;
>       > one instance only gets long items, one only short). Make sure the 
> slabs
>       > aren't imbalanced via the slab rebalancer.
>       >
>       > - Are the sizes of the items correlated with their TTL? Are 120s items
>       > always in a ~300 byte range and longer items tend to be in a different
>       > byte range? You could use length pagging to shunt them into specific 
> slab
>       > classes, separating them internally at the cost of some ram 
> efficiency.
>       >
>       > - A storage engine (god I wish we'd made 1.6 work...) which allows
>       > bucketing by TTL ranges. You'd want a smaller set of slab classes to 
> not
>       > waste too much memory here, but the idea is the same as running 
> multiple
>       > individual instances, except internally splitting the storage engine
>       > instead and storing everything in the same hash table.
>       >
>       > Those three options completely avoid latency problems, the first one
>       > requires no code modifications and will work very well. The third is 
> the
>       > most work (and will be tricky due to things like slab rebalance, and 
> none
>       > of the slab class identification code will work). I would avoid it 
> unless
>       > I were really bored and wanted to maintain my own fork forever.
>       >
>       > > ~2 years ago i created another version based on that 1.4.13, than 
> does garbage collection using custom stats handler. That version
>       is able to be
>       > > running on half of the memory for like 2 weeks, with 0 evictions. 
> But we gave it full 16G and just restart it each week to be sure
>       memory usage is
>       > > kept in check, and we're not throwing away good data. Actually 
> after changing -f1.25 to -f1.041 the slabs are filling with bad
>       items much slower,
>       > > because items are distributed better and this custom eviction 
> function is able to catch more expired data. We have like 200GB of
>       data evicted this
>       > > way, daily. Because of volume (~40k req/s peak, much of it are 
> writes) and differences in expire time LRU isn't able to reclaim
>       items efficiently.
>       > >
>       > > Maybe people don't even realize the problem, but when we done some 
> testing and turned off that "custom" eviction we had like 100%
>       memory used with
>       > > 10% of waste reported by memcached admin. But then we run that 
> custom eviction algorithm it turned out that 90% of memory is
>       occupied by garbage.
>       > > Waste reported grew to 80% instantly after running unlimited 
> "reclaim expired" on all items in the cache. So in "standard" client
>       when people will
>       > > be using different expire times for items (we have it like 1minute 
> minimum, 6h max)... they even won't be able to see how much
>       memory they're
>       > > wasting in some specific cases, when they'll have many items that 
> won't be hit after expiration, like we have.
>       > >
>       > > When using memcached as a buffer for mysql writes, we know exactly 
> what to hit and when. Short TTL expired items, pile up near the
>       head... long TTL
>       > > "live" items pile up near the tail and it's creating a barrier that 
> prevents the LRU algo to reclaim almost anything, if im getting
>       how it
>       > > currently works, correctly...
>       > >
>       > > >You made it sound like you had some data which never expired? Is 
> this true? 
>       > > Yes, i think because of how evictions are made (to be clear we're 
> not setting non-expiring items). These short expiring items pile
>       up in the front
>       > > of linked list, something that is supposed to live for eg. 120 or 
> 180 seconds is lingering in memory forever, untill we restart the
>       cache... and
>       > > new items are killed almost instantly because there are no expired 
> items on head.
>       > >
>       > > It's a special case, because after processing memory list, 
> aggregating data and putting it in mysql these items are no longer
>       touched. The list for
>       > > new time period will have completely different set of keys. As we 
> use a prefix to generate all items in the list.
>       > >
>       > > $time_slice = floor( self::$time / 60) - $time_slices_back;
>       > > $prefix = ")ML){$list_id}-{$time_slice}";
>       > >
>       > > Again, not saying current implementation is bad... because it's 
> fast and doesn't trash CPU cache when expire times are ~equal, that
>       was probably
>       > > the idea... but we have not typical use case, which LRU isn't able 
> to manage...
>       > >
>       > > Now im making ~same changes i made for .13... but for .17 and i 
> want to make it working a little better ;)
>       > >
>       > >
>       > >
>       > > W dniu piątek, 11 kwietnia 2014 05:12:10 UTC+2 użytkownik Dormando 
> napisał:
>       > >
>       > >       > Hey Dormando, thanks again for some comments... appreciate 
> the help.
>       > >       >
>       > >       > Maybe i wasn't clear enough. I need only 1 minute 
> persistence, and i can lose data sometimes, just i can't keep loosing
>       data every
>       > >       minute due to
>       > >       > constant evictions caused by LRU. Actually i have just 
> wrote that in my previous post. We're loosing about 1 minute of
>       > >       non-meaningfull data every
>       > >       > week because of restart that we do when memory starts to 
> fill up (even with our patch reclaiming using linked list, we
>       limit
>       > >       reclaiming to keep
>       > >       > speed better)... so the memory fills up after a week, not 
> 30 minutes...
>       > >
>       > >       Can you explain what you're seeing in more detail? Your data 
> only needs to
>       > >       persist for 1 minute, but it's being evicted before 1 minute 
> is up?
>       > >
>       > >       You made it sound like you had some data which never expired? 
> Is this
>       > >       true?
>       > >
>       > >       If your instance is 16GB, takes a week to fill up, but data 
> only needs to
>       > >       persist for a minute but isn't, something else is very 
> broken? Or am I
>       > >       still misunderstanding you?
>       > >
>       > >       > Now im creating better solution, to limit locking as linked 
> list is getting bigger.
>       > >       >
>       > >       > I explained what was worst implications of unwanted 
> evictions (or loosing all data in cache) in my use case:
>       > >       > 1. loosing ~1 minute of non-significant data that's about 
> to be stored in sql
>       > >       > 2. "flat" distribution of load to workers (not taking 
> response times into account because stats reset).
>       > >       > 3. resorting to alternative targeting algorithm (with 
> global, not local statistics).
>       > >       >
>       > >       > I never, ever said im going to write data that have to be 
> persistent permanently. It's actually same idea as delayed write.
>       If power
>       > >       fails you
>       > >       > loose 5s of data, but you can do 100x more writes. So you 
> need the data to be persistent in memory, between writes the data
>       **can't
>       > >       be lost**.
>       > >       > However you can lose it sometimes, that's the tradeoff that 
> some people can make and some not. Obviously I can't keep
>       loosing this
>       > >       data each
>       > >       > minute, because if i loose much it'll become meaningfull.
>       > >       >
>       > >       > Maybe i wasn't clear in that matter. I can loose all data 
> even 20 times a day. Sensitive data is stored using bulk update
>       or
>       > >       transactions,
>       > >       > bypassing that "delayed write" layer. "0 evictions", that's 
> the kind of "persistence" im going for. So items are persistent
>       for some
>       > >       very short
>       > >       > periods of time (1-5 minutes) without being killed. It's 
> just different use case. Running in production since 2 years,
>       based on
>       > >       1.4.13, tested for
>       > >       > corectness, monitored so we have enough memory and 0 
> evictions (just reclaims)
>       > >       >
>       > >       > When i came here with same idea ~2 years ago you just said 
> it's very stupid, now you even made me look like a moron :) And
>       i can
>       > >       understand why you
>       > >       > don't want features that are not ~O(1) perfectly, but 
> please don't get so personal about different ideas to do things and
>       use cases,
>       > >       just because
>       > >       > these won't work for you.
>       > >       >
>       > >       >
>       > >       >
>       > >       >
>       > >       >
>       > >       > W dniu czwartek, 10 kwietnia 2014 20:53:12 UTC+2 użytkownik 
> Dormando napisał:
>       > >       >       You really really really really really *must* not put 
> data in memcached
>       > >       >       which you can't lose.
>       > >       >
>       > >       >       Seriously, really don't do it. If you need 
> persistence, try using a redis
>       > >       >       instance for the persistent stuff, and use memcached 
> for your cache stuff.
>       > >       >       I don't see why you feel like you need to write your 
> own thing, there're a
>       > >       >       lot of persistent key/value stores 
> (kyotocabinet/etc?). They have a much
>       > >       >       lower request ceiling and don't handle the LRU/cache 
> pattern as well, but
>       > >       >       that's why you can use both.
>       > >       >
>       > >       >       Again, please please don't do it. You are damaging 
> your company. You are a
>       > >       >       *danger* to your company.
>       > >       >
>       > >       >       On Thu, 10 Apr 2014, Slawomir Pryczek wrote:
>       > >       >
>       > >       >       > Hi Dormando, thanks for suggestions, background 
> thread would be nice...
>       > >       >       > The idea is actually that with 2-3GB i get plenty 
> of evictions of items that need to be fetched later. And with
>       16GB i still
>       > >       get
>       > >       >       evictions,
>       > >       >       > actually probably i could throw more memory than 
> 16G and it'd only result in more expired items sitting in the
>       middle of
>       > >       slabs,
>       > >       >       forever... Now im
>       > >       >       > going for persistence. Sounds probably crazy, but 
> we're having some data that we can't loose:
>       > >       >       > 1. statistics, we aggregate writes to DB using 
> memcached (+list implementation). If these items get evicted we're
>       loosing
>       > >       rows in db.
>       > >       >       Loosing data
>       > >       >       > sometimes isn't a big problem. Eg. we restart 
> memcached once a week so we're loosing 1 minute of data every week.
>       But if we
>       > >       have
>       > >       >       evictions we're
>       > >       >       > loosing data constantly (which we can't have)
>       > >       >       > 2. we drive load balancer using data in memcached 
> for statistics, again, not nice to loose data often because
>       workers can get
>       > >       >       incorrect amount of
>       > >       >       > traffic.
>       > >       >       > 3. we're doing some adserving optimizations, eg. 
> counting per-domain ad priority, for one domain it takes about 10
>       seconds to
>       > >       analyze
>       > >       >       all data and
>       > >       >       > create list of ads, so can't be done online... we 
> put result of this in memcached, if we loose too much of this the
>       system
>       > >       will start
>       > >       >       to serve
>       > >       >       > suboptimal ads (because it'll need to switch to 
> more general data or much simpler algorithm that can be done
>       instantly)
>       > >       >       >
>       > >       >       > Probably would be best to rewrite all this using C 
> or golang, and use memcached just for caching, but it'd take too
>       much time
>       > >       which
>       > >       >       we don't have
>       > >       >       > currently...
>       > >       >       >
>       > >       >       > I have seen twitter and nk implementations that 
> seem to do what i need, but they seem old (based on old code), so I
>       prefer to
>       > >       modify
>       > >       >       code of recent
>       > >       >       > "official" memcached, to not be stuck with old code 
> or abandonware. Actually there are many topics about
>       limitations of
>       > >       currrent
>       > >       >       eviction algo and
>       > >       >       > option to enable some background thread to do 
> scraping based on statistics of most filled slabs (with some
>       parameter to
>       > >       specify if it
>       > >       >       should take
>       > >       >       > light or aggressive approach) would be nice...
>       > >       >       >
>       > >       >       > As for the code... is that slab_rebalance_move 
> function in slab.c? It seems a little difficult to gasp without some
>       DOCs of
>       > >       how
>       > >       >       things are
>       > >       >       > working... can you please write a very short 
> description of how this "angry birds" more workd?
>       > >       >
>       > >       >       Look at doc/protocol.txt for explanations of the slab 
> move options. the
>       > >       >       names are greppable back to the source.
>       > >       >
>       > >       >       > I have quick question about this above... linked is 
> item that's placed on linked list, but what other flags means,
>       and why 2
>       > >       last are
>       > >       >       2 of them
>       > >       >       > temporary?
>       > >       >       > #define ITEM_LINKED 1
>       > >       >       > #define ITEM_CAS 2
>       > >       >       >
>       > >       >       > /* temp */
>       > >       >       > #define ITEM_SLABBED 4
>       > >       >       > #define ITEM_FETCHED 8
>       > >       >       >
>       > >       >       > This from slab_rebalance_move seems interesting:
>       > >       >       > refcount = refcount_incr(&it->refcount);
>       > >       >       > ...
>       > >       >       > if (refcount == 1) { /* item is unlinked, unused */
>       > >       >       > ...
>       > >       >       > } else if (refcount == 2) { /* item is linked but 
> not busy */
>       > >       >       >
>       > >       >       > Is there some docs about refcounts, locks and item 
> states? Basically why item with refcount 2 is not busy? You're
>       increasing
>       > >       refcount
>       > >       >       by 1 on
>       > >       >       > select, then again when reading data? Can refcount 
> ever be higher than 2 (3 in above case), meaning 2 threads can
>       access same
>       > >       item?
>       > >       >
>       > >       >       The comment on the same line is explaining exactly 
> what it means.
>       > >       >
>       > >       >       Unfortunately it's a bit of a crap shoot. I think I 
> wrote a threads
>       > >       >       explanation somewhnere (some release notes, or in a 
> file in there, I can't
>       > >       >       quite remember offhand). Since scaling the thread 
> code it got a lot more
>       > >       >       complicated. You have to be extremely careful under 
> what circumstances you
>       > >       >       access items (you must hold an item lock + the 
> refcount must be 2 if you
>       > >       >       want to unlink it).
>       > >       >
>       > >       >       You'll just have to study it a bit, sorry. Grep 
> around to see where the
>       > >       >       flags are used.
>       > >       >
>       > >       >       > Thanks.
>       > >       >       >
>       > >       >       > W dniu czwartek, 10 kwietnia 2014 06:05:30 UTC+2 
> użytkownik Dormando napisał:
>       > >       >       >       > Hi Guys,
>       > >       >       >       > im running a specific case where i don't 
> want (actually can't have) to have evicted items (evictions = 0
>       ideally)...
>       > >       now i
>       > >       >       have
>       > >       >       >       created some simple
>       > >       >       >       > algo that lock the cache, goes through 
> linked list and evicts items... it makes some problems, like 10-20ms
>       cache
>       > >       locks on
>       > >       >       some
>       > >       >       >       cases.
>       > >       >       >       >
>       > >       >       >       > Now im thinking about going through each 
> slab memory (slabs keep a list of allocated memory regions) ...
>       looking for
>       > >       items,
>       > >       >       if
>       > >       >       >       expired item is
>       > >       >       >       > found, evict it... this way i can go eg. 
> 10k items or 1MB of memory at a time + pick slabs with high
>       utilization and
>       > >       run this
>       > >       >       >       "additional" eviction
>       > >       >       >       > only on them... so it'll prevent allocating 
> memory just because unneded data with short TTL is occupying
>       HEAD of the
>       > >       list.
>       > >       >       >       >
>       > >       >       >       > With this linked list eviction im able to 
> run on 2-3GB of memory... without it 16GB of memory is exhausted
>       in 1-2h
>       > >       and then
>       > >       >       memcached
>       > >       >       >       starts to
>       > >       >       >       > kill "good" items (leaving expired ones 
> wasting memory)...
>       > >       >       >       >
>       > >       >       >       > Any comments?
>       > >       >       >       > Thanks.
>       > >       >       >
>       > >       >       >       you're going a bit against the base 
> algorithm. if stuff is falling out of
>       > >       >       >       16GB of memory without ever being utilized 
> again, why is that critical?
>       > >       >       >       Sounds like you're optimizing the numbers 
> instead of actually tuning
>       > >       >       >       anything useful.
>       > >       >       >
>       > >       >       >       That said, you can probably just extend the 
> slab rebalance code. There's a
>       > >       >       >       hook in there (which I called "Angry birds 
> mode") that drives a slab
>       > >       >       >       rebalance when it'd otherwise run an 
> eviction. That code already safely
>       > >       >       >       walks the slab page for unlocked memory and 
> frees it; you could edit it
>       > >       >       >       slightly to check for expiration and then 
> freelist it into the slab class
>       > >       >       >       instead.
>       > >       >       >
>       > >       >       >       Since it's already a background thread you 
> could further modify it to just
>       > >       >       >       wake up and walk pages for stuff to evict.
>       > >       >       >
>       > >       >       > --
>       > >       >       >
>       > >       >       > ---
>       > >       >       > You received this message because you are 
> subscribed to the Google Groups "memcached" group.
>       > >       >       > To unsubscribe from this group and stop receiving 
> emails from it, send an email to [email protected].
>       > >       >       > For more options, visit 
> https://groups.google.com/d/optout.
>       > >       >       >
>       > >       >       >
>       > >       >
>       > >       > --
>       > >       >
>       > >       > ---
>       > >       > You received this message because you are subscribed to the 
> Google Groups "memcached" group.
>       > >       > To unsubscribe from this group and stop receiving emails 
> from it, send an email to [email protected].
>       > >       > For more options, visit https://groups.google.com/d/optout.
>       > >       >
>       > >       >
>       > >
>       > > --
>       > >
>       > > ---
>       > > You received this message because you are subscribed to the Google 
> Groups "memcached" group.
>       > > To unsubscribe from this group and stop receiving emails from it, 
> send an email to [email protected].
>       > > For more options, visit https://groups.google.com/d/optout.
>       > >
>       > >
>       >
>       > --
>       >
>       > ---
>       > You received this message because you are subscribed to the Google 
> Groups "memcached" group.
>       > To unsubscribe from this group and stop receiving emails from it, 
> send an email to [email protected].
>       > For more options, visit https://groups.google.com/d/optout.
>       >
>
>       --
>
>       ---
>       You received this message because you are subscribed to the Google 
> Groups "memcached" group.
>       To unsubscribe from this group and stop receiving emails from it, send 
> an email to [email protected].
>       For more options, visit https://groups.google.com/d/optout.
>
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Idea for reclaimation algo

Reply via email to