Re: Check for orphaned items in lru crawler thread

Scott Mansfield Tue, 21 Jul 2015 16:55:50 -0700

I realize I've not given you the tests to reproduce the behavior. I should 
be able to soon. Sorry about the delay here.


In the mean time, I wanted to bring up a possible secondary use of the same 
logic to move items on slab rebalancing. I think the system might benefit 
from using the same logic to crawl the pages in a slab and compact the data 
in the background. In the case where we have memory that is assigned to the 
slab but not being used because of replaced or TTL'd out data, returning 
the memory to a pool of free memory will allow a slab to grow with that 
memory first instead of waiting for an event where memory is needed at that 
instant.

It's a change in approach, from reactive to proactive. What do you think?

On Monday, July 13, 2015 at 5:54:11 PM UTC-7, Dormando wrote:
>
> > First, more detail for you: 
> > 
> > We are running 1.4.24 in production and haven't noticed any bugs as of 
> yet. The new LRUs seem to be working well, though we nearly always run 
> memcached scaled to hold all data without evictions. Those with evictions 
> are behaving well. Those without evictions haven't seen crashing or any 
> other noticeable bad behavior. 
>
> Neat. 
>
> > 
> > OK, I think I see an area where I was speculating on functionality. If 
> you have a key in slab 21 and then the same key is written again at a 
> larger size in slab 23 I assumed that the space in 21 was not freed on the 
> second write. With that assumption, the LRU crawler would not free up that 
> space. Also just by observation in the macro, the space is not freed 
> > fast enough to be effective, in our use case, to accept the writes that 
> are happening. Think in the hundreds of millions of "overwrites" in a 6 - 
> 10 hour period across a cluster. 
>
> Internally, "items" (a key/value pair) are generally immutable. The only 
> time when it's not is for INCR/DECR, and it still becomes immutable if two 
> INCR/DECR's collide. 
>
> What this means, is that the new item is staged in a piece of free memory 
> while the "upload" stage of the SET happens. When memcached has all of the 
> data in memory to replace the item, it does an internal swap under a lock. 
> The old item is removed from the hash table and LRU, and the new item gets 
> put in its place (at the head of the LRU). 
>
> Since items are refcounted, this means that if other users are downloading 
> an item which just got replaced, their memory doesn't get corrupted by the 
> item changing out from underneath them. They can continue to read the old 
> item until they're done. When the refcount reaches zero the old memory is 
> reclaimed. 
>
> Most of the time, the item replacement happens then the old memory is 
> immediately removed. 
>
> However, this does mean that you need *one* piece of free memory to 
> replace the old one. Then the old memory gets freed after that set. 
>
> So if you take a memcached instance with 0 free chunks, and do a rolling 
> replacement of all items (within the same slab class as before), the first 
> one would cause an eviction from the tail of the LRU to get a free chunk. 
> Every SET after that would use the chunk freed from the replacement of the 
> previous memory. 
>
> > After that last sentence I realized I also may not have explained well 
> enough the access pattern. The keys are all overwritten every day, but it 
> takes some time to write them all (obviously). We see a huge increase in 
> the bytes metric as if the new data for the old keys was being written for 
> the first time. Since the "old" slab for the same key doesn't 
> > proactively release memory, it starts to fill up the cache and then 
> start evicting data in the new slab. Once that happens, we see evictions in 
> the old slab because of the algorithm you mentioned (random picking / 
> freeing of memory). Typically we don't see any use for "upgrading" an item 
> as the new data would be entirely new and should wholesale replace the 
> > old data for that key. More specifically, the operation is always set, 
> with different data each day. 
>
> Right. Most of your problems will come from two areas. One being that 
> writing data aggressively into the new slab class (unless you set the 
> rebalancer to always-replace mode), the mover will make memory available 
> more slowly than you can insert. So you'll cause extra evictions in the 
> new slab class. 
>
> The secondary problem is from the random evictions in the previous slab 
> class as stuff is chucked on the floor to make memory moveable. 
>
> > As for testing, we'll be able to put it under real production workload. 
> I don't know what kind of data you mean you need for testing. The data 
> stored in the caches are highly confidential. I can give you all kinds of 
> metrics, since we collect most of the ones that are in the stats and some 
> from the stats slabs output. If you have some specific ones that 
> > need collecting, I'll double check and make sure we can get those. 
> Alternatively, it might be most beneficial to see the metrics in person :) 
>
> I just need stats snapshots here and there, and actually putting the thing 
> under load. When I did the LRU work I had to beg for several months 
> before anyone tested it with a production load. This slows things down and 
> demotivates me from working on the project. 
>
> Unfortunately my dayjob keeps me pretty busy so ~internet~ would probably 
> be best. 
>
> > I can create a driver program to reproduce the behavior on a smaller 
> scale. It would write e.g. 10k keys of 10k size, then rewrite the same keys 
> with different size data. I'll work on that and post it to this thread when 
> I can reproduce the behavior locally. 
>
> Ok. There're slab rebalance unit tests in the t/ directory which do things 
> like this, and I've used mc-crusher to slam the rebalancer. It's pretty 
> easy to run one config to load up 10k objects, then flip to the other 
> using the same key namespace. 
>
> > Thanks, 
> > Scott 
> > 
> > On Saturday, July 11, 2015 at 12:05:54 PM UTC-7, Dormando wrote: 
> >       Hey, 
> > 
> >       On Fri, 10 Jul 2015, Scott Mansfield wrote: 
> > 
> >       > We've seen issues recently where we run a cluster that typically 
> has the majority of items overwritten in the same slab every day and a 
> sudden change in data size evicts a ton of data, affecting downstream 
> systems. To be clear that is our problem, but I think there's a tweak in 
> memcached that might be useful and another possible feature that 
> >       would be even 
> >       > better. 
> >       > The data that is written to this cache is overwritten every day, 
> though the TTL is 7 days. One slab takes up the majority of the space in 
> the cache. The application wrote e.g. 10KB (slab 21) every day for each key 
> consistently. One day, a change occurred where it started writing 15KB 
> (slab 23), causing a migration of data from one slab to 
> >       another. We had -o 
> >       > slab_reassign,slab_automove=1 set on the server, causing large 
> numbers of evictions on the initial slab. Let's say the cache could hold 
> the data at 15KB per key, but the old data was not technically TTL'd out in 
> it's old slab. This means that memory was not being freed by the lru 
> crawler thread (I think) because its expiry had not come 
> >       around.  
> >       > 
> >       > lines 1199 and 1200 in items.c: 
> >       > if ((search->exptime != 0 && search->exptime < current_time) || 
> is_flushed(search)) { 
> >       > 
> >       > If there was a check to see if this data was "orphaned," i.e. 
> that the key, if accessed, would map to a different slab than the current 
> one, then these orphans could be reclaimed as free memory. I am working on 
> a patch to do this, though I have reservations about performing a hash on 
> the key on the lru crawler thread (if the hash is not 
> >       already available). 
> >       > I have very little experience in the memcached codebase so I 
> don't know the most efficient way to do this. Any help would be 
> appreciated. 
> > 
> >       There seems to be a misconception about how the slab classes work. 
> A key, 
> >       if already existing in a slab, will always map to the slab class 
> it 
> >       currently fits into. The slab classes always exist, but the amount 
> of 
> >       memory reserved for each of them will shift with the 
> slab_reassign. ie: 10 
> >       pages in slab class 21, then memory pressure on 23 causes it to 
> move over. 
> > 
> >       So if you examine a key that still exists in slab class 21, it has 
> no 
> >       reason to move up or down the slab classes. 
> > 
> >       > Alternatively, and possibly more beneficial is compaction of 
> data in a slab using the same set of criteria as lru crawling. 
> Understandably, compaction is a very difficult problem to solve since 
> moving the data would be a pain in the ass. I saw a couple of discussions 
> about this in the mailing list, though I didn't see any firm thoughts about 
> >       it. I think it 
> >       > can probably be done in O(1) like the lru crawler by limiting 
> the number of items it touches each time. Writing and reading are doable in 
> O(1) so moving should be as well. Has anyone given more thought on 
> compaction? 
> > 
> >       I'd be interested in hacking this up for you folks if you can 
> provide me 
> >       testing and some data to work with. With all of the LRU work I did 
> in 
> >       1.4.24, the next things I wanted to do is a big improvement on the 
> slab 
> >       reassignment code. 
> > 
> >       Currently it picks essentially a random slab page, empties it, and 
> moves 
> >       the slab page into the class under pressure. 
> > 
> >       One thing we can do is first examine for free memory in the 
> existing slab, 
> >       IE: 
> > 
> >       - Take a page from slab 21 
> >       - Scan the page for valid items which need to be moved 
> >       - Pull free memory from slab 21, migrate the item (moderately 
> complicated) 
> >       - When the page is empty, move it (or give up if you run out of 
> free 
> >       chunks). 
> > 
> >       The next step is to pull from the LRU on slab 21: 
> > 
> >       - Take page from slab 21 
> >       - Scan page for valid items 
> >       - Pull free memory from slab 21, migrate the item 
> >         - If no memory free, evict tail of slab 21. use that chunk. 
> >       - When the page is empty, move it. 
> > 
> >       Then, when you hit this condition your least-recently-used data 
> gets 
> >       culled as new data migrates your page class. This should match a 
> natural 
> >       occurrance if you would already be evicting valid (but old) items 
> to make 
> >       room for new items. 
> > 
> >       A bonus to using the free memory trick, is that I can use the 
> amount of 
> >       free space in a slab class as a heuristic to more quickly move 
> slab pages 
> >       around. 
> > 
> >       If it's still necessary from there, we can explore "upgrading" 
> items to a 
> >       new slab class, but that is much much more complicated since the 
> item has 
> >       to shift LRU's. Do you put it at the head, the tail, the middle, 
> etc? It 
> >       might be impossible to make a good generic decision there. 
> > 
> >       What version are you currently on? If 1.4.24, have you seen any 
> >       instability? I'm currently torn between fighting a few bugs and 
> start on 
> >       improving the slab rebalancer. 
> > 
> >       -Dormando 
> > 
> > 
> > On Saturday, July 11, 2015 at 12:05:54 PM UTC-7, Dormando wrote: 
> >       Hey, 
> > 
> >       On Fri, 10 Jul 2015, Scott Mansfield wrote: 
> > 
> >       > We've seen issues recently where we run a cluster that typically 
> has the majority of items overwritten in the same slab every day and a 
> sudden change in data size evicts a ton of data, affecting downstream 
> systems. To be clear that is our problem, but I think there's a tweak in 
> memcached that might be useful and another possible feature that 
> >       would be even 
> >       > better. 
> >       > The data that is written to this cache is overwritten every day, 
> though the TTL is 7 days. One slab takes up the majority of the space in 
> the cache. The application wrote e.g. 10KB (slab 21) every day for each key 
> consistently. One day, a change occurred where it started writing 15KB 
> (slab 23), causing a migration of data from one slab to 
> >       another. We had -o 
> >       > slab_reassign,slab_automove=1 set on the server, causing large 
> numbers of evictions on the initial slab. Let's say the cache could hold 
> the data at 15KB per key, but the old data was not technically TTL'd out in 
> it's old slab. This means that memory was not being freed by the lru 
> crawler thread (I think) because its expiry had not come 
> >       around.  
> >       > 
> >       > lines 1199 and 1200 in items.c: 
> >       > if ((search->exptime != 0 && search->exptime < current_time) || 
> is_flushed(search)) { 
> >       > 
> >       > If there was a check to see if this data was "orphaned," i.e. 
> that the key, if accessed, would map to a different slab than the current 
> one, then these orphans could be reclaimed as free memory. I am working on 
> a patch to do this, though I have reservations about performing a hash on 
> the key on the lru crawler thread (if the hash is not 
> >       already available). 
> >       > I have very little experience in the memcached codebase so I 
> don't know the most efficient way to do this. Any help would be 
> appreciated. 
> > 
> >       There seems to be a misconception about how the slab classes work. 
> A key, 
> >       if already existing in a slab, will always map to the slab class 
> it 
> >       currently fits into. The slab classes always exist, but the amount 
> of 
> >       memory reserved for each of them will shift with the 
> slab_reassign. ie: 10 
> >       pages in slab class 21, then memory pressure on 23 causes it to 
> move over. 
> > 
> >       So if you examine a key that still exists in slab class 21, it has 
> no 
> >       reason to move up or down the slab classes. 
> > 
> >       > Alternatively, and possibly more beneficial is compaction of 
> data in a slab using the same set of criteria as lru crawling. 
> Understandably, compaction is a very difficult problem to solve since 
> moving the data would be a pain in the ass. I saw a couple of discussions 
> about this in the mailing list, though I didn't see any firm thoughts about 
> >       it. I think it 
> >       > can probably be done in O(1) like the lru crawler by limiting 
> the number of items it touches each time. Writing and reading are doable in 
> O(1) so moving should be as well. Has anyone given more thought on 
> compaction? 
> > 
> >       I'd be interested in hacking this up for you folks if you can 
> provide me 
> >       testing and some data to work with. With all of the LRU work I did 
> in 
> >       1.4.24, the next things I wanted to do is a big improvement on the 
> slab 
> >       reassignment code. 
> > 
> >       Currently it picks essentially a random slab page, empties it, and 
> moves 
> >       the slab page into the class under pressure. 
> > 
> >       One thing we can do is first examine for free memory in the 
> existing slab, 
> >       IE: 
> > 
> >       - Take a page from slab 21 
> >       - Scan the page for valid items which need to be moved 
> >       - Pull free memory from slab 21, migrate the item (moderately 
> complicated) 
> >       - When the page is empty, move it (or give up if you run out of 
> free 
> >       chunks). 
> > 
> >       The next step is to pull from the LRU on slab 21: 
> > 
> >       - Take page from slab 21 
> >       - Scan page for valid items 
> >       - Pull free memory from slab 21, migrate the item 
> >         - If no memory free, evict tail of slab 21. use that chunk. 
> >       - When the page is empty, move it. 
> > 
> >       Then, when you hit this condition your least-recently-used data 
> gets 
> >       culled as new data migrates your page class. This should match a 
> natural 
> >       occurrance if you would already be evicting valid (but old) items 
> to make 
> >       room for new items. 
> > 
> >       A bonus to using the free memory trick, is that I can use the 
> amount of 
> >       free space in a slab class as a heuristic to more quickly move 
> slab pages 
> >       around. 
> > 
> >       If it's still necessary from there, we can explore "upgrading" 
> items to a 
> >       new slab class, but that is much much more complicated since the 
> item has 
> >       to shift LRU's. Do you put it at the head, the tail, the middle, 
> etc? It 
> >       might be impossible to make a good generic decision there. 
> > 
> >       What version are you currently on? If 1.4.24, have you seen any 
> >       instability? I'm currently torn between fighting a few bugs and 
> start on 
> >       improving the slab rebalancer. 
> > 
> >       -Dormando 
> > 
> > -- 
> > 
> > --- 
> > You received this message because you are subscribed to the Google 
> Groups "memcached" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to memcached+...@googlegroups.com <javascript:>. 
> > For more options, visit https://groups.google.com/d/optout. 
> > 
> >

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Check for orphaned items in lru crawler thread

Reply via email to