Re: Check for orphaned items in lru crawler thread

Scott Mansfield Mon, 03 Aug 2015 00:54:10 -0700

The command line I've used that will start is:

memcached -m 64 -o slab_reassign,slab_automove



the ones that fail are:


memcached -m 64 -o slab_reassign,slab_automove,lru_crawler,lru_maintainer

memcached -o lru_crawler


I'm sure I've missed something during compile, though I just used 
./configure and make.

On Monday, August 3, 2015 at 12:22:33 AM UTC-7, Scott Mansfield wrote:
>
> I've attached a pretty simple program to connect, fill a slab with data, 
> and then fill another slab slowly with data of a different size. I've been 
> trying to get memcached to run with the lru_crawler and lru_maintainer 
> flags, but I get ' 
>
> Illegal suboption "(null)"' every time I try to start with either in any 
> configuration.
>
>
> I haven't seen it start to move slabs automatically with a freshly 
> installed 1.2.24.
>
> On Tuesday, July 21, 2015 at 4:55:17 PM UTC-7, Scott Mansfield wrote:
>>
>> I realize I've not given you the tests to reproduce the behavior. I 
>> should be able to soon. Sorry about the delay here.
>>
>> In the mean time, I wanted to bring up a possible secondary use of the 
>> same logic to move items on slab rebalancing. I think the system might 
>> benefit from using the same logic to crawl the pages in a slab and compact 
>> the data in the background. In the case where we have memory that is 
>> assigned to the slab but not being used because of replaced or TTL'd out 
>> data, returning the memory to a pool of free memory will allow a slab to 
>> grow with that memory first instead of waiting for an event where memory is 
>> needed at that instant.
>>
>> It's a change in approach, from reactive to proactive. What do you think?
>>
>> On Monday, July 13, 2015 at 5:54:11 PM UTC-7, Dormando wrote:
>>>
>>> > First, more detail for you: 
>>> > 
>>> > We are running 1.4.24 in production and haven't noticed any bugs as of 
>>> yet. The new LRUs seem to be working well, though we nearly always run 
>>> memcached scaled to hold all data without evictions. Those with evictions 
>>> are behaving well. Those without evictions haven't seen crashing or any 
>>> other noticeable bad behavior. 
>>>
>>> Neat. 
>>>
>>> > 
>>> > OK, I think I see an area where I was speculating on functionality. If 
>>> you have a key in slab 21 and then the same key is written again at a 
>>> larger size in slab 23 I assumed that the space in 21 was not freed on the 
>>> second write. With that assumption, the LRU crawler would not free up that 
>>> space. Also just by observation in the macro, the space is not freed 
>>> > fast enough to be effective, in our use case, to accept the writes 
>>> that are happening. Think in the hundreds of millions of "overwrites" in a 
>>> 6 - 10 hour period across a cluster. 
>>>
>>> Internally, "items" (a key/value pair) are generally immutable. The only 
>>> time when it's not is for INCR/DECR, and it still becomes immutable if 
>>> two 
>>> INCR/DECR's collide. 
>>>
>>> What this means, is that the new item is staged in a piece of free 
>>> memory 
>>> while the "upload" stage of the SET happens. When memcached has all of 
>>> the 
>>> data in memory to replace the item, it does an internal swap under a 
>>> lock. 
>>> The old item is removed from the hash table and LRU, and the new item 
>>> gets 
>>> put in its place (at the head of the LRU). 
>>>
>>> Since items are refcounted, this means that if other users are 
>>> downloading 
>>> an item which just got replaced, their memory doesn't get corrupted by 
>>> the 
>>> item changing out from underneath them. They can continue to read the 
>>> old 
>>> item until they're done. When the refcount reaches zero the old memory 
>>> is 
>>> reclaimed. 
>>>
>>> Most of the time, the item replacement happens then the old memory is 
>>> immediately removed. 
>>>
>>> However, this does mean that you need *one* piece of free memory to 
>>> replace the old one. Then the old memory gets freed after that set. 
>>>
>>> So if you take a memcached instance with 0 free chunks, and do a rolling 
>>> replacement of all items (within the same slab class as before), the 
>>> first 
>>> one would cause an eviction from the tail of the LRU to get a free 
>>> chunk. 
>>> Every SET after that would use the chunk freed from the replacement of 
>>> the 
>>> previous memory. 
>>>
>>> > After that last sentence I realized I also may not have explained well 
>>> enough the access pattern. The keys are all overwritten every day, but it 
>>> takes some time to write them all (obviously). We see a huge increase in 
>>> the bytes metric as if the new data for the old keys was being written for 
>>> the first time. Since the "old" slab for the same key doesn't 
>>> > proactively release memory, it starts to fill up the cache and then 
>>> start evicting data in the new slab. Once that happens, we see evictions in 
>>> the old slab because of the algorithm you mentioned (random picking / 
>>> freeing of memory). Typically we don't see any use for "upgrading" an item 
>>> as the new data would be entirely new and should wholesale replace the 
>>> > old data for that key. More specifically, the operation is always set, 
>>> with different data each day. 
>>>
>>> Right. Most of your problems will come from two areas. One being that 
>>> writing data aggressively into the new slab class (unless you set the 
>>> rebalancer to always-replace mode), the mover will make memory available 
>>> more slowly than you can insert. So you'll cause extra evictions in the 
>>> new slab class. 
>>>
>>> The secondary problem is from the random evictions in the previous slab 
>>> class as stuff is chucked on the floor to make memory moveable. 
>>>
>>> > As for testing, we'll be able to put it under real production 
>>> workload. I don't know what kind of data you mean you need for testing. The 
>>> data stored in the caches are highly confidential. I can give you all kinds 
>>> of metrics, since we collect most of the ones that are in the stats and 
>>> some from the stats slabs output. If you have some specific ones that 
>>> > need collecting, I'll double check and make sure we can get those. 
>>> Alternatively, it might be most beneficial to see the metrics in person :) 
>>>
>>> I just need stats snapshots here and there, and actually putting the 
>>> thing 
>>> under load. When I did the LRU work I had to beg for several months 
>>> before anyone tested it with a production load. This slows things down 
>>> and 
>>> demotivates me from working on the project. 
>>>
>>> Unfortunately my dayjob keeps me pretty busy so ~internet~ would 
>>> probably 
>>> be best. 
>>>
>>> > I can create a driver program to reproduce the behavior on a smaller 
>>> scale. It would write e.g. 10k keys of 10k size, then rewrite the same keys 
>>> with different size data. I'll work on that and post it to this thread when 
>>> I can reproduce the behavior locally. 
>>>
>>> Ok. There're slab rebalance unit tests in the t/ directory which do 
>>> things 
>>> like this, and I've used mc-crusher to slam the rebalancer. It's pretty 
>>> easy to run one config to load up 10k objects, then flip to the other 
>>> using the same key namespace. 
>>>
>>> > Thanks, 
>>> > Scott 
>>> > 
>>> > On Saturday, July 11, 2015 at 12:05:54 PM UTC-7, Dormando wrote: 
>>> >       Hey, 
>>> > 
>>> >       On Fri, 10 Jul 2015, Scott Mansfield wrote: 
>>> > 
>>> >       > We've seen issues recently where we run a cluster that 
>>> typically has the majority of items overwritten in the same slab every day 
>>> and a sudden change in data size evicts a ton of data, affecting downstream 
>>> systems. To be clear that is our problem, but I think there's a tweak in 
>>> memcached that might be useful and another possible feature that 
>>> >       would be even 
>>> >       > better. 
>>> >       > The data that is written to this cache is overwritten every 
>>> day, though the TTL is 7 days. One slab takes up the majority of the space 
>>> in the cache. The application wrote e.g. 10KB (slab 21) every day for each 
>>> key consistently. One day, a change occurred where it started writing 15KB 
>>> (slab 23), causing a migration of data from one slab to 
>>> >       another. We had -o 
>>> >       > slab_reassign,slab_automove=1 set on the server, causing large 
>>> numbers of evictions on the initial slab. Let's say the cache could hold 
>>> the data at 15KB per key, but the old data was not technically TTL'd out in 
>>> it's old slab. This means that memory was not being freed by the lru 
>>> crawler thread (I think) because its expiry had not come 
>>> >       around.  
>>> >       > 
>>> >       > lines 1199 and 1200 in items.c: 
>>> >       > if ((search->exptime != 0 && search->exptime < current_time) 
>>> || is_flushed(search)) { 
>>> >       > 
>>> >       > If there was a check to see if this data was "orphaned," i.e. 
>>> that the key, if accessed, would map to a different slab than the current 
>>> one, then these orphans could be reclaimed as free memory. I am working on 
>>> a patch to do this, though I have reservations about performing a hash on 
>>> the key on the lru crawler thread (if the hash is not 
>>> >       already available). 
>>> >       > I have very little experience in the memcached codebase so I 
>>> don't know the most efficient way to do this. Any help would be 
>>> appreciated. 
>>> > 
>>> >       There seems to be a misconception about how the slab classes 
>>> work. A key, 
>>> >       if already existing in a slab, will always map to the slab class 
>>> it 
>>> >       currently fits into. The slab classes always exist, but the 
>>> amount of 
>>> >       memory reserved for each of them will shift with the 
>>> slab_reassign. ie: 10 
>>> >       pages in slab class 21, then memory pressure on 23 causes it to 
>>> move over. 
>>> > 
>>> >       So if you examine a key that still exists in slab class 21, it 
>>> has no 
>>> >       reason to move up or down the slab classes. 
>>> > 
>>> >       > Alternatively, and possibly more beneficial is compaction of 
>>> data in a slab using the same set of criteria as lru crawling. 
>>> Understandably, compaction is a very difficult problem to solve since 
>>> moving the data would be a pain in the ass. I saw a couple of discussions 
>>> about this in the mailing list, though I didn't see any firm thoughts about 
>>> >       it. I think it 
>>> >       > can probably be done in O(1) like the lru crawler by limiting 
>>> the number of items it touches each time. Writing and reading are doable in 
>>> O(1) so moving should be as well. Has anyone given more thought on 
>>> compaction? 
>>> > 
>>> >       I'd be interested in hacking this up for you folks if you can 
>>> provide me 
>>> >       testing and some data to work with. With all of the LRU work I 
>>> did in 
>>> >       1.4.24, the next things I wanted to do is a big improvement on 
>>> the slab 
>>> >       reassignment code. 
>>> > 
>>> >       Currently it picks essentially a random slab page, empties it, 
>>> and moves 
>>> >       the slab page into the class under pressure. 
>>> > 
>>> >       One thing we can do is first examine for free memory in the 
>>> existing slab, 
>>> >       IE: 
>>> > 
>>> >       - Take a page from slab 21 
>>> >       - Scan the page for valid items which need to be moved 
>>> >       - Pull free memory from slab 21, migrate the item (moderately 
>>> complicated) 
>>> >       - When the page is empty, move it (or give up if you run out of 
>>> free 
>>> >       chunks). 
>>> > 
>>> >       The next step is to pull from the LRU on slab 21: 
>>> > 
>>> >       - Take page from slab 21 
>>> >       - Scan page for valid items 
>>> >       - Pull free memory from slab 21, migrate the item 
>>> >         - If no memory free, evict tail of slab 21. use that chunk. 
>>> >       - When the page is empty, move it. 
>>> > 
>>> >       Then, when you hit this condition your least-recently-used data 
>>> gets 
>>> >       culled as new data migrates your page class. This should match a 
>>> natural 
>>> >       occurrance if you would already be evicting valid (but old) 
>>> items to make 
>>> >       room for new items. 
>>> > 
>>> >       A bonus to using the free memory trick, is that I can use the 
>>> amount of 
>>> >       free space in a slab class as a heuristic to more quickly move 
>>> slab pages 
>>> >       around. 
>>> > 
>>> >       If it's still necessary from there, we can explore "upgrading" 
>>> items to a 
>>> >       new slab class, but that is much much more complicated since the 
>>> item has 
>>> >       to shift LRU's. Do you put it at the head, the tail, the middle, 
>>> etc? It 
>>> >       might be impossible to make a good generic decision there. 
>>> > 
>>> >       What version are you currently on? If 1.4.24, have you seen any 
>>> >       instability? I'm currently torn between fighting a few bugs and 
>>> start on 
>>> >       improving the slab rebalancer. 
>>> > 
>>> >       -Dormando 
>>> > 
>>> > 
>>> > On Saturday, July 11, 2015 at 12:05:54 PM UTC-7, Dormando wrote: 
>>> >       Hey, 
>>> > 
>>> >       On Fri, 10 Jul 2015, Scott Mansfield wrote: 
>>> > 
>>> >       > We've seen issues recently where we run a cluster that 
>>> typically has the majority of items overwritten in the same slab every day 
>>> and a sudden change in data size evicts a ton of data, affecting downstream 
>>> systems. To be clear that is our problem, but I think there's a tweak in 
>>> memcached that might be useful and another possible feature that 
>>> >       would be even 
>>> >       > better. 
>>> >       > The data that is written to this cache is overwritten every 
>>> day, though the TTL is 7 days. One slab takes up the majority of the space 
>>> in the cache. The application wrote e.g. 10KB (slab 21) every day for each 
>>> key consistently. One day, a change occurred where it started writing 15KB 
>>> (slab 23), causing a migration of data from one slab to 
>>> >       another. We had -o 
>>> >       > slab_reassign,slab_automove=1 set on the server, causing large 
>>> numbers of evictions on the initial slab. Let's say the cache could hold 
>>> the data at 15KB per key, but the old data was not technically TTL'd out in 
>>> it's old slab. This means that memory was not being freed by the lru 
>>> crawler thread (I think) because its expiry had not come 
>>> >       around.  
>>> >       > 
>>> >       > lines 1199 and 1200 in items.c: 
>>> >       > if ((search->exptime != 0 && search->exptime < current_time) 
>>> || is_flushed(search)) { 
>>> >       > 
>>> >       > If there was a check to see if this data was "orphaned," i.e. 
>>> that the key, if accessed, would map to a different slab than the current 
>>> one, then these orphans could be reclaimed as free memory. I am working on 
>>> a patch to do this, though I have reservations about performing a hash on 
>>> the key on the lru crawler thread (if the hash is not 
>>> >       already available). 
>>> >       > I have very little experience in the memcached codebase so I 
>>> don't know the most efficient way to do this. Any help would be 
>>> appreciated. 
>>> > 
>>> >       There seems to be a misconception about how the slab classes 
>>> work. A key, 
>>> >       if already existing in a slab, will always map to the slab class 
>>> it 
>>> >       currently fits into. The slab classes always exist, but the 
>>> amount of 
>>> >       memory reserved for each of them will shift with the 
>>> slab_reassign. ie: 10 
>>> >       pages in slab class 21, then memory pressure on 23 causes it to 
>>> move over. 
>>> > 
>>> >       So if you examine a key that still exists in slab class 21, it 
>>> has no 
>>> >       reason to move up or down the slab classes. 
>>> > 
>>> >       > Alternatively, and possibly more beneficial is compaction of 
>>> data in a slab using the same set of criteria as lru crawling. 
>>> Understandably, compaction is a very difficult problem to solve since 
>>> moving the data would be a pain in the ass. I saw a couple of discussions 
>>> about this in the mailing list, though I didn't see any firm thoughts about 
>>> >       it. I think it 
>>> >       > can probably be done in O(1) like the lru crawler by limiting 
>>> the number of items it touches each time. Writing and reading are doable in 
>>> O(1) so moving should be as well. Has anyone given more thought on 
>>> compaction? 
>>> > 
>>> >       I'd be interested in hacking this up for you folks if you can 
>>> provide me 
>>> >       testing and some data to work with. With all of the LRU work I 
>>> did in 
>>> >       1.4.24, the next things I wanted to do is a big improvement on 
>>> the slab 
>>> >       reassignment code. 
>>> > 
>>> >       Currently it picks essentially a random slab page, empties it, 
>>> and moves 
>>> >       the slab page into the class under pressure. 
>>> > 
>>> >       One thing we can do is first examine for free memory in the 
>>> existing slab, 
>>> >       IE: 
>>> > 
>>> >       - Take a page from slab 21 
>>> >       - Scan the page for valid items which need to be moved 
>>> >       - Pull free memory from slab 21, migrate the item (moderately 
>>> complicated) 
>>> >       - When the page is empty, move it (or give up if you run out of 
>>> free 
>>> >       chunks). 
>>> > 
>>> >       The next step is to pull from the LRU on slab 21: 
>>> > 
>>> >       - Take page from slab 21 
>>> >       - Scan page for valid items 
>>> >       - Pull free memory from slab 21, migrate the item 
>>> >         - If no memory free, evict tail of slab 21. use that chunk. 
>>> >       - When the page is empty, move it. 
>>> > 
>>> >       Then, when you hit this condition your least-recently-used data 
>>> gets 
>>> >       culled as new data migrates your page class. This should match a 
>>> natural 
>>> >       occurrance if you would already be evicting valid (but old) 
>>> items to make 
>>> >       room for new items. 
>>> > 
>>> >       A bonus to using the free memory trick, is that I can use the 
>>> amount of 
>>> >       free space in a slab class as a heuristic to more quickly move 
>>> slab pages 
>>> >       around. 
>>> > 
>>> >       If it's still necessary from there, we can explore "upgrading" 
>>> items to a 
>>> >       new slab class, but that is much much more complicated since the 
>>> item has 
>>> >       to shift LRU's. Do you put it at the head, the tail, the middle, 
>>> etc? It 
>>> >       might be impossible to make a good generic decision there. 
>>> > 
>>> >       What version are you currently on? If 1.4.24, have you seen any 
>>> >       instability? I'm currently torn between fighting a few bugs and 
>>> start on 
>>> >       improving the slab rebalancer. 
>>> > 
>>> >       -Dormando 
>>> > 
>>> > -- 
>>> > 
>>> > --- 
>>> > You received this message because you are subscribed to the Google 
>>> Groups "memcached" group. 
>>> > To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to memcached+...@googlegroups.com. 
>>> > For more options, visit https://groups.google.com/d/optout. 
>>> > 
>>> >
>>
>>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Check for orphaned items in lru crawler thread

Reply via email to