Yes :) If I recall how that works, it's mildly similar to a few other things I've seen. Not super trivial to implement in a short period of time though.
On Sat, 12 Apr 2014, Ryan McElroy wrote: > Facebook implemented a "visitor" plugin system that we use to kick out > already-expired items in our memcached instances. It runs at low priority > and doesn't cause much latency that we notice. I should really get our > version back out there so that others can see how we did it and implement it > in the legit memcached :-) > ~Ryan > > > On Fri, Apr 11, 2014 at 11:08 AM, dormando <[email protected]> wrote: > s/pagging/padding/. gah. > > On Fri, 11 Apr 2014, dormando wrote: > > > > > > > On Fri, 11 Apr 2014, Slawomir Pryczek wrote: > > > > > Hi Dormando, more about the behaviour... when we're using "normal" > memcached 1.4.13 16GB of memory gets exhausted in ~1h, then we > start to have > > > almost instant evictions of needed items (again these items aren't > really "needed" individually, just when many of them gets > evicted it's > > > unacceptable because of how badly it affects the system) > > > > Almost instant evictions; so an item is stored, into a 16GB instance, > and > > < 120 seconds later is bumped out of the LRU? > > > > You'll probably just ignore me again, but isn't this just slab > imbalance? > > Once your instance fills up there're probably a few slab classes with > way > > too little memory in them. > > > > 'stats slabs' shows you per-slab eviction rates, along with the last > > accessed time of an item when it was evicted. What does this look > like on > > one of your full instances? > > > > The slab rebalance system lets you plug in your own algorithm by > running > > the page reassignment commands manually. Then you can smooth out the > pages > > to where you think they should be. > > > > You mention long and short TTL, but what are they exactly? 120s and an > > hour? A week? > > > > I understand your desire to hack up something to solve this, but as > you've > > already seen scanning memory to remove expired items is problematic: > > you're either going to do long walks from the tail, use a background > > thread and walk a "probe" item through, or walk through random slab > pages > > looking for expired memory. None of these are very efficient and tend > to > > rely on luck. > > > > A better way to do this is to bucket the memory by TTL. You have lots > of > > pretty decent options for this (and someone else already suggested > one): > > > > - In your client, use different memcached pools for major TTL buckets > (ie; > > one instance only gets long items, one only short). Make sure the > slabs > > aren't imbalanced via the slab rebalancer. > > > > - Are the sizes of the items correlated with their TTL? Are 120s items > > always in a ~300 byte range and longer items tend to be in a different > > byte range? You could use length pagging to shunt them into specific > slab > > classes, separating them internally at the cost of some ram > efficiency. > > > > - A storage engine (god I wish we'd made 1.6 work...) which allows > > bucketing by TTL ranges. You'd want a smaller set of slab classes to > not > > waste too much memory here, but the idea is the same as running > multiple > > individual instances, except internally splitting the storage engine > > instead and storing everything in the same hash table. > > > > Those three options completely avoid latency problems, the first one > > requires no code modifications and will work very well. The third is > the > > most work (and will be tricky due to things like slab rebalance, and > none > > of the slab class identification code will work). I would avoid it > unless > > I were really bored and wanted to maintain my own fork forever. > > > > > ~2 years ago i created another version based on that 1.4.13, than > does garbage collection using custom stats handler. That version > is able to be > > > running on half of the memory for like 2 weeks, with 0 evictions. > But we gave it full 16G and just restart it each week to be sure > memory usage is > > > kept in check, and we're not throwing away good data. Actually > after changing -f1.25 to -f1.041 the slabs are filling with bad > items much slower, > > > because items are distributed better and this custom eviction > function is able to catch more expired data. We have like 200GB of > data evicted this > > > way, daily. Because of volume (~40k req/s peak, much of it are > writes) and differences in expire time LRU isn't able to reclaim > items efficiently. > > > > > > Maybe people don't even realize the problem, but when we done some > testing and turned off that "custom" eviction we had like 100% > memory used with > > > 10% of waste reported by memcached admin. But then we run that > custom eviction algorithm it turned out that 90% of memory is > occupied by garbage. > > > Waste reported grew to 80% instantly after running unlimited > "reclaim expired" on all items in the cache. So in "standard" client > when people will > > > be using different expire times for items (we have it like 1minute > minimum, 6h max)... they even won't be able to see how much > memory they're > > > wasting in some specific cases, when they'll have many items that > won't be hit after expiration, like we have. > > > > > > When using memcached as a buffer for mysql writes, we know exactly > what to hit and when. Short TTL expired items, pile up near the > head... long TTL > > > "live" items pile up near the tail and it's creating a barrier that > prevents the LRU algo to reclaim almost anything, if im getting > how it > > > currently works, correctly... > > > > > > >You made it sound like you had some data which never expired? Is > this true? > > > Yes, i think because of how evictions are made (to be clear we're > not setting non-expiring items). These short expiring items pile > up in the front > > > of linked list, something that is supposed to live for eg. 120 or > 180 seconds is lingering in memory forever, untill we restart the > cache... and > > > new items are killed almost instantly because there are no expired > items on head. > > > > > > It's a special case, because after processing memory list, > aggregating data and putting it in mysql these items are no longer > touched. The list for > > > new time period will have completely different set of keys. As we > use a prefix to generate all items in the list. > > > > > > $time_slice = floor( self::$time / 60) - $time_slices_back; > > > $prefix = ")ML){$list_id}-{$time_slice}"; > > > > > > Again, not saying current implementation is bad... because it's > fast and doesn't trash CPU cache when expire times are ~equal, that > was probably > > > the idea... but we have not typical use case, which LRU isn't able > to manage... > > > > > > Now im making ~same changes i made for .13... but for .17 and i > want to make it working a little better ;) > > > > > > > > > > > > W dniu piątek, 11 kwietnia 2014 05:12:10 UTC+2 użytkownik Dormando > napisał: > > > > > > > Hey Dormando, thanks again for some comments... appreciate > the help. > > > > > > > > Maybe i wasn't clear enough. I need only 1 minute > persistence, and i can lose data sometimes, just i can't keep loosing > data every > > > minute due to > > > > constant evictions caused by LRU. Actually i have just > wrote that in my previous post. We're loosing about 1 minute of > > > non-meaningfull data every > > > > week because of restart that we do when memory starts to > fill up (even with our patch reclaiming using linked list, we > limit > > > reclaiming to keep > > > > speed better)... so the memory fills up after a week, not > 30 minutes... > > > > > > Can you explain what you're seeing in more detail? Your data > only needs to > > > persist for 1 minute, but it's being evicted before 1 minute > is up? > > > > > > You made it sound like you had some data which never expired? > Is this > > > true? > > > > > > If your instance is 16GB, takes a week to fill up, but data > only needs to > > > persist for a minute but isn't, something else is very > broken? Or am I > > > still misunderstanding you? > > > > > > > Now im creating better solution, to limit locking as linked > list is getting bigger. > > > > > > > > I explained what was worst implications of unwanted > evictions (or loosing all data in cache) in my use case: > > > > 1. loosing ~1 minute of non-significant data that's about > to be stored in sql > > > > 2. "flat" distribution of load to workers (not taking > response times into account because stats reset). > > > > 3. resorting to alternative targeting algorithm (with > global, not local statistics). > > > > > > > > I never, ever said im going to write data that have to be > persistent permanently. It's actually same idea as delayed write. > If power > > > fails you > > > > loose 5s of data, but you can do 100x more writes. So you > need the data to be persistent in memory, between writes the data > **can't > > > be lost**. > > > > However you can lose it sometimes, that's the tradeoff that > some people can make and some not. Obviously I can't keep > loosing this > > > data each > > > > minute, because if i loose much it'll become meaningfull. > > > > > > > > Maybe i wasn't clear in that matter. I can loose all data > even 20 times a day. Sensitive data is stored using bulk update > or > > > transactions, > > > > bypassing that "delayed write" layer. "0 evictions", that's > the kind of "persistence" im going for. So items are persistent > for some > > > very short > > > > periods of time (1-5 minutes) without being killed. It's > just different use case. Running in production since 2 years, > based on > > > 1.4.13, tested for > > > > corectness, monitored so we have enough memory and 0 > evictions (just reclaims) > > > > > > > > When i came here with same idea ~2 years ago you just said > it's very stupid, now you even made me look like a moron :) And > i can > > > understand why you > > > > don't want features that are not ~O(1) perfectly, but > please don't get so personal about different ideas to do things and > use cases, > > > just because > > > > these won't work for you. > > > > > > > > > > > > > > > > > > > > > > > > W dniu czwartek, 10 kwietnia 2014 20:53:12 UTC+2 użytkownik > Dormando napisał: > > > > You really really really really really *must* not put > data in memcached > > > > which you can't lose. > > > > > > > > Seriously, really don't do it. If you need > persistence, try using a redis > > > > instance for the persistent stuff, and use memcached > for your cache stuff. > > > > I don't see why you feel like you need to write your > own thing, there're a > > > > lot of persistent key/value stores > (kyotocabinet/etc?). They have a much > > > > lower request ceiling and don't handle the LRU/cache > pattern as well, but > > > > that's why you can use both. > > > > > > > > Again, please please don't do it. You are damaging > your company. You are a > > > > *danger* to your company. > > > > > > > > On Thu, 10 Apr 2014, Slawomir Pryczek wrote: > > > > > > > > > Hi Dormando, thanks for suggestions, background > thread would be nice... > > > > > The idea is actually that with 2-3GB i get plenty > of evictions of items that need to be fetched later. And with > 16GB i still > > > get > > > > evictions, > > > > > actually probably i could throw more memory than > 16G and it'd only result in more expired items sitting in the > middle of > > > slabs, > > > > forever... Now im > > > > > going for persistence. Sounds probably crazy, but > we're having some data that we can't loose: > > > > > 1. statistics, we aggregate writes to DB using > memcached (+list implementation). If these items get evicted we're > loosing > > > rows in db. > > > > Loosing data > > > > > sometimes isn't a big problem. Eg. we restart > memcached once a week so we're loosing 1 minute of data every week. > But if we > > > have > > > > evictions we're > > > > > loosing data constantly (which we can't have) > > > > > 2. we drive load balancer using data in memcached > for statistics, again, not nice to loose data often because > workers can get > > > > incorrect amount of > > > > > traffic. > > > > > 3. we're doing some adserving optimizations, eg. > counting per-domain ad priority, for one domain it takes about 10 > seconds to > > > analyze > > > > all data and > > > > > create list of ads, so can't be done online... we > put result of this in memcached, if we loose too much of this the > system > > > will start > > > > to serve > > > > > suboptimal ads (because it'll need to switch to > more general data or much simpler algorithm that can be done > instantly) > > > > > > > > > > Probably would be best to rewrite all this using C > or golang, and use memcached just for caching, but it'd take too > much time > > > which > > > > we don't have > > > > > currently... > > > > > > > > > > I have seen twitter and nk implementations that > seem to do what i need, but they seem old (based on old code), so I > prefer to > > > modify > > > > code of recent > > > > > "official" memcached, to not be stuck with old code > or abandonware. Actually there are many topics about > limitations of > > > currrent > > > > eviction algo and > > > > > option to enable some background thread to do > scraping based on statistics of most filled slabs (with some > parameter to > > > specify if it > > > > should take > > > > > light or aggressive approach) would be nice... > > > > > > > > > > As for the code... is that slab_rebalance_move > function in slab.c? It seems a little difficult to gasp without some > DOCs of > > > how > > > > things are > > > > > working... can you please write a very short > description of how this "angry birds" more workd? > > > > > > > > Look at doc/protocol.txt for explanations of the slab > move options. the > > > > names are greppable back to the source. > > > > > > > > > I have quick question about this above... linked is > item that's placed on linked list, but what other flags means, > and why 2 > > > last are > > > > 2 of them > > > > > temporary? > > > > > #define ITEM_LINKED 1 > > > > > #define ITEM_CAS 2 > > > > > > > > > > /* temp */ > > > > > #define ITEM_SLABBED 4 > > > > > #define ITEM_FETCHED 8 > > > > > > > > > > This from slab_rebalance_move seems interesting: > > > > > refcount = refcount_incr(&it->refcount); > > > > > ... > > > > > if (refcount == 1) { /* item is unlinked, unused */ > > > > > ... > > > > > } else if (refcount == 2) { /* item is linked but > not busy */ > > > > > > > > > > Is there some docs about refcounts, locks and item > states? Basically why item with refcount 2 is not busy? You're > increasing > > > refcount > > > > by 1 on > > > > > select, then again when reading data? Can refcount > ever be higher than 2 (3 in above case), meaning 2 threads can > access same > > > item? > > > > > > > > The comment on the same line is explaining exactly > what it means. > > > > > > > > Unfortunately it's a bit of a crap shoot. I think I > wrote a threads > > > > explanation somewhnere (some release notes, or in a > file in there, I can't > > > > quite remember offhand). Since scaling the thread > code it got a lot more > > > > complicated. You have to be extremely careful under > what circumstances you > > > > access items (you must hold an item lock + the > refcount must be 2 if you > > > > want to unlink it). > > > > > > > > You'll just have to study it a bit, sorry. Grep > around to see where the > > > > flags are used. > > > > > > > > > Thanks. > > > > > > > > > > W dniu czwartek, 10 kwietnia 2014 06:05:30 UTC+2 > użytkownik Dormando napisał: > > > > > > Hi Guys, > > > > > > im running a specific case where i don't > want (actually can't have) to have evicted items (evictions = 0 > ideally)... > > > now i > > > > have > > > > > created some simple > > > > > > algo that lock the cache, goes through > linked list and evicts items... it makes some problems, like 10-20ms > cache > > > locks on > > > > some > > > > > cases. > > > > > > > > > > > > Now im thinking about going through each > slab memory (slabs keep a list of allocated memory regions) ... > looking for > > > items, > > > > if > > > > > expired item is > > > > > > found, evict it... this way i can go eg. > 10k items or 1MB of memory at a time + pick slabs with high > utilization and > > > run this > > > > > "additional" eviction > > > > > > only on them... so it'll prevent allocating > memory just because unneded data with short TTL is occupying > HEAD of the > > > list. > > > > > > > > > > > > With this linked list eviction im able to > run on 2-3GB of memory... without it 16GB of memory is exhausted > in 1-2h > > > and then > > > > memcached > > > > > starts to > > > > > > kill "good" items (leaving expired ones > wasting memory)... > > > > > > > > > > > > Any comments? > > > > > > Thanks. > > > > > > > > > > you're going a bit against the base > algorithm. if stuff is falling out of > > > > > 16GB of memory without ever being utilized > again, why is that critical? > > > > > Sounds like you're optimizing the numbers > instead of actually tuning > > > > > anything useful. > > > > > > > > > > That said, you can probably just extend the > slab rebalance code. There's a > > > > > hook in there (which I called "Angry birds > mode") that drives a slab > > > > > rebalance when it'd otherwise run an > eviction. That code already safely > > > > > walks the slab page for unlocked memory and > frees it; you could edit it > > > > > slightly to check for expiration and then > freelist it into the slab class > > > > > instead. > > > > > > > > > > Since it's already a background thread you > could further modify it to just > > > > > wake up and walk pages for stuff to evict. > > > > > > > > > > -- > > > > > > > > > > --- > > > > > You received this message because you are > subscribed to the Google Groups "memcached" group. > > > > > To unsubscribe from this group and stop receiving > emails from it, send an email to [email protected]. > > > > > For more options, visit > https://groups.google.com/d/optout. > > > > > > > > > > > > > > > > > > -- > > > > > > > > --- > > > > You received this message because you are subscribed to the > Google Groups "memcached" group. > > > > To unsubscribe from this group and stop receiving emails > from it, send an email to [email protected]. > > > > For more options, visit https://groups.google.com/d/optout. > > > > > > > > > > > > > > -- > > > > > > --- > > > You received this message because you are subscribed to the Google > Groups "memcached" group. > > > To unsubscribe from this group and stop receiving emails from it, > send an email to [email protected]. > > > For more options, visit https://groups.google.com/d/optout. > > > > > > > > > > -- > > > > --- > > You received this message because you are subscribed to the Google > Groups "memcached" group. > > To unsubscribe from this group and stop receiving emails from it, > send an email to [email protected]. > > For more options, visit https://groups.google.com/d/optout. > > > > -- > > --- > You received this message because you are subscribed to the Google > Groups "memcached" group. > To unsubscribe from this group and stop receiving emails from it, send > an email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > > > -- > > --- > You received this message because you are subscribed to the Google Groups > "memcached" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > > -- --- You received this message because you are subscribed to the Google Groups "memcached" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
