> For the "running multiple copies"... im using persistent connection > but are you sure the amount of TCP communication will be good for > performance.
have you tested it? you're making an awful lot of assumptions and seem to be really itching to go modify some core code and deploy it. Why not *test* the simplest ideas first and move on when you have to? > I mean even locking the whole slab that has 1mb and > scanning it? Will it take more than 1 ms on modern machine? Beside > it's complicated to rewrite application like this. If you're blocking the daemon at all, you're causing anything that would be running in parallel to block for that 1ms. For really low request rates that's fine, but we must support much more than that. > @Dormando... why you call it "bizzare"? :) Rebalancing slabs shouldn't > be much different. Because it's a corner case, and your solution is to do a *ton* of work. So much so that it walks into another corner case itself; someone with a 192G cache with 100 million entries that are all *valid*, would end up constantly locking/unlocking the cache while never invalidating anything. Your tradeoff is to just move the corner case to another area, my complaint is that it's not sufficiently generic for us to ship. > What you think about forking the app? (i mean forking the in-memory > process). Should work well on modern kernel without locking because > you have copy-on-write? Maybe locking then copying the whole single > slab? I can allocate some buffer, which will be size of single slab > then use LOCK, copy ONE slab into the buffer and use another thread to > build a list of items we can remove. Copying eg. 1 mb of memory should > happen in no time. I had some test code which memcpy'd about 2k of memory 5 times per second while holding a stats lock, and that cut the top throughput by at least 5%. The impact was worse than that, since the test code had removed dozens of (uncontsted) mutex lock calls and replaced them with the tiny memcpy. > Generally you think i should move the cleanup into storage engine? How > advanced is that (production ready?) > > > The worst we do is in slab rebalance, which holds a slab logically and > > glances at it > > with tiny locks. > The good thing about cleanup is that you won't have to use tiny locks > (i think). Just lock the slab, copy memory and then wake up some > thread to take a look, add the keys to some list then just process the > list from time to time (or am i wrong?) > > Can you give me some pointers please? > > for now im seeing you're using: > it = heads[slabs_clsid]; > then iterate it = it->next; > > that's probably why you say it's too slow... but what if we just > lock=>copy one slab's memory=>unlock=>analyze slab=>[make 100 get > requsts=>sleep]repeat? We have fixed size items in slab so we know > exactly where the key and expiration time is, right? I tried to explain the method I'd been thinking of for doing this most efficiently, but you seem to be ignoring that. There's just no way in hell we'll ever ship something that issues requests against itself or forks or copies memory around to scan them. Here are some reasons, and then even more alternatives (since your request rate is really low): 1) the most common use case has a mix of reads and writes, not as much writes and then batch reads (which you're doing). Which means common keys with a 5 second expiration would get fetched and expired more naturally. everything else would fall through the bottom due to disuse. 2) tossing huge chunks of memory around then issuing mass fetches back against itself doesn't test well. Issuing more locks doesn't test well (especially on NUMA; contesting locks or copying memory around causes cacheline flushes, pipeline stalls, cross-cpu memory barriers, etc). I've tested this, copying 1mb of memory is not fast enough for us, if I can't even copy 2k without impacting performance. 3) Issuing extraneous micro locks or scanning does terrible things to large instances for the above reasons. If your traffic pattern *isn't* your particular corner case, everything else gets slower. You could also ship a copy of all your short-expiration SET's to syslog, and have a daemon tailing the syslog and issuing gets as things expire... then you don't need to block the daemon at all but you're still issuing all those extra gets. but, again, if you're really attached to doing it your way, go ahead and use the engine-pu branch. In a few months memcached will do this better anyway, and I don't agree with the method you're insisting on. I see too many alternatives that have the potential to work far better.
