Re: tail repair issue (1.4.20)

dormando Wed, 27 Aug 2014 23:22:27 -0700

Thanks so much for sticking around and testing!

I have a number of bugs to go over as I mentioned before, so it may take a
little longer to bake this into a release. I still want to add a cap on
how much churn it allows, so for 10,000 items you might instead get a
handful of OOM's. This is to deal with extreme cases regardless.


Again, thanks. It's been really hard to get people to stick around for
this; first we had to fix the crash caused by items sitting in the LRU,
then it became apparent why they were there and we could fix that issue.
I'm happy to understand this.

On Tue, 26 Aug 2014, Jay Grizzard wrote:

> Okay, so, we did some testing!
> I deployed a test build last Thursday and let it run with no further changes, 
> graphing the ‘reflocked’ counter (which is the metric I added for
> ‘refcounted so moved to other end of LRU’). The graph for that ends up 
> looking like this: http://i.imgur.com/0CZfHWf.png
>
> Basically a spike on restart (which makes sense, there’s probably a few 
> fast-expiring or deleted entries on the tail almost immediately), and then
> occasional spikes over time. More spikes than I actually *expected*, but none 
> are particularly large, and I’d completely believe that we had
> ‘legitimate’ locking of items in there, too. So I consider that completely 
> helpful.
>
> (The graph is total across all slabs, and peaks at 8/sec, and only briefly, 
> so… yeah, healthy.)
>
> The other thing I did yesterday was to intentionally lock a bunch of items to 
> see what the behavior looked like. I picked a slab that was
> relatively high churn (max item age ~6000) and had no reflocks at all. 
> Created 10k items and locked them. The reflocked graph for that looks like
> this: http://i.imgur.com/oghSU3o.png
>
> Basically, one big spike every couple of hours (with the interval decreasing 
> as traffic increases). You can’t see it from the graph, but the
> reflocked counter increments by exactly 10,000 for each spike, while the 
> outofmemory counter stays at zero. This is exactly what I expected to
> happen, which is awesome.
>
> We’ve otherwise been really stable with the patch, so I think I’m fairly 
> comfortable saying the patch you provided is a reasonable solution to the
> problem. I’d even be satisfied without adding anything else to limit number 
> of moves to 5 in a go, since the odds of that being an issue in just
> about any situation seem … low. But if you can add it cleanly, go for it! :)
>
> Let me know when you have a final patch (which would presumably be a release 
> candidate for 1.4.21) and I’ll be happy to verify that as well, and
> then we can officially declare this bug dead and have a little party, since I 
> totally think finally finding this thing is deserving of a party… ;)
>
> -j
>
>
> On Thu, Aug 21, 2014 at 12:33 PM, dormando <[email protected]> wrote:
>       Okay cool.
>
>       As I mentioned with the original link I will be adding some sort of 
> sanity
>       checking to break the loop. I just have to reorganize the whole thing 
> and
>       ran out of time (I got stuck for a while because unlink was wiping
>       search->prev and it kept bailing the loop :P)
>
>       I need someone to try it to see if it's the right approach first, then 
> the
>       rest is doable. It's just tricky code and requires some care.
>
>       Thanks for putting some effort into this. I really appreciate it!
>
>       On Thu, 21 Aug 2014, Jay Grizzard wrote:
>
>       > Hi, sorry about the slow response. Naturally, the daily problem we 
> were having stopped as soon as you checked in that patch.
>       > Typical, eh?
>       > Anyhow, I’ve studied the patch and it seems to be pretty good — the 
> only worry I have is that if you end up with the extremely
>       > degenerate case of an entire LRU being refcounted, you have to walk 
> the entire LRU before returning ‘out of memory’. I’m not
>       > thinking that this is a big problem (because if you have a few tens 
> of thousands of items, that’s pretty quick… and if you have
>       > millions… well, why do you have millions of items refcounted?), but 
> worth at least noting. 
>       >
>       > I was going to suggest a change to make it fit into the ‘tries’ loop 
> better so those moves got counted as a try, but there
>       > doesn’t seem to be a particularly clean way to do that, so I’m 
> willing to just accept it as a limitation that might get hit in
>       > situations far worse than the one that’s causing us issues right now. 
> I’m okay with that. 
>       >
>       > I haven’t tried the patch under production load yet, because I wanted 
> to have stats to give us some information about what was
>       > going on under the hood. I finally got a chance to add in an 
> additional stat for refcounted items on the tail — I sent you a PR
>       > with that patch (https://github.com/dormando/memcached/pull/1). I 
> *think* I got the right things in the right places, though you
>       > may take issue with the stat name (“reflocked”).
>       >
>       > Now that I have the stats, I’m going to work on putting a patched 
> copy out under production load to make sure it holds up there,
>       > and at least see about artificially generating one of the hung-get 
> situations that was causing us problems. I’ll let you know
>       > how that works out!
>       >
>       > -j
>       >
>       >
>       > On Mon, Aug 11, 2014 at 8:54 PM, dormando <[email protected]> wrote:
>       >
>       >       > > Well, sounds like whatever process was asking for that data 
> is dead (and> possibly pissing off a customer) so
>       >       you should
>       >       > indeed figure out what
>       >       > > that's about.
>       >       >
>       >       > Yeah, we’ll definitely hunt this one down. I’ll have to toss 
> up a monitor to look for things in a write state for
>       >       extended
>       >       > periods and then go do some tracing (rather than, say, 
> waiting for it to actually break again). We *do* have some
>       >       legitimately
>       >       > long-running (multi-hour) things going on, so can’t just say 
> “long connection bad!”, but it would be nice if maybe
>       >       those
>       >       > processes could slurp their entire response upfront or some 
> such.
>       >       >
>       >       >
>       >       > > I think another thing we can do is actually throw a 
> refcounted-for-a-long-time 
>       >       > > item back to the front of the LRU. I'll try a patch for 
> that this weekend. It should
>       >       > > have no real overhead compared to other approaches of 
> timing out connections.
>       >       >
>       >       > Is there any reason you can’t do “if refcount > 1 when 
> walking the end of the tail, send to the front” without
>       >       requiring
>       >       > ‘refcounted for a long time’ (with, of course, still limiting 
> it to 5ish actions)? It seems like this would be
>       >       pretty safe,
>       >       > since generally stuff at the end of LRU shouldn’t have a 
> refcount, and then you don’t need extra code for figuring
>       >       out how long
>       >       > something has been refcounted.
>       >       >
>       >       > I guess there’s a slightly degenerate case in there, which is 
> that if you have a small slab that’s 100%
>       >       refcounted, you end up
>       >       > cycling a bunch of pointers every write just to run the LRU 
> in a big circle and never write anything (similar to
>       >       the case you
>       >       > suggest in your last paragraph), but that’s a situation I’m 
> totally willing to accept. ;)
>       >       >
>       >       > Anyhow, looking forward to a patch, and will gladly help test!
>       >       >
>       >
>       > Here, try out this branch:
>       > https://github.com/dormando/memcached/tree/refchuck
>       >
>       > It needs some cleanup and sanity checking. I want to redo the loop 
> instead
>       > of the weird goto, add an arg to item_update intead of copypasta, and 
> add
>       > one or two sanity checks to break the loop if you're trying to alloc 
> out
>       > of a class that's 100% reflocked.
>       >
>       > I added a test that works okay. Fails before, runs after. Can you try
>       > this on one or two machines and see what the impact is?
>       >
>       > If it works okay I'll clean it up and merge. Need to spend a little 
> more
>       > time on the PR queue before I can cut though.
>       >
>       > --
>       >
>       > ---
>       > You received this message because you are subscribed to the Google 
> Groups "memcached" group.
>       > To unsubscribe from this group and stop receiving emails from it, 
> send an email to [email protected].
>       > For more options, visit https://groups.google.com/d/optout.
>       >
>       >
>       > --
>       >
>       > ---
>       > You received this message because you are subscribed to the Google 
> Groups "memcached" group.
>       > To unsubscribe from this group and stop receiving emails from it, 
> send an email to [email protected].
>       > For more options, visit https://groups.google.com/d/optout.
>       >
>       >
>
>       --
>
>       ---
>       You received this message because you are subscribed to the Google 
> Groups "memcached" group.
>       To unsubscribe from this group and stop receiving emails from it, send 
> an email to [email protected].
>       For more options, visit https://groups.google.com/d/optout.
>
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: tail repair issue (1.4.20)

Reply via email to