Re: tail repair issue (1.4.20)

dormando Thu, 21 Aug 2014 12:33:58 -0700

Okay cool.

As I mentioned with the original link I will be adding some sort of sanity
checking to break the loop. I just have to reorganize the whole thing and
ran out of time (I got stuck for a while because unlink was wiping
search->prev and it kept bailing the loop :P)


I need someone to try it to see if it's the right approach first, then the
rest is doable. It's just tricky code and requires some care.

Thanks for putting some effort into this. I really appreciate it!

On Thu, 21 Aug 2014, Jay Grizzard wrote:

> Hi, sorry about the slow response. Naturally, the daily problem we were 
> having stopped as soon as you checked in that patch.
> Typical, eh?
> Anyhow, I’ve studied the patch and it seems to be pretty good — the only 
> worry I have is that if you end up with the extremely
> degenerate case of an entire LRU being refcounted, you have to walk the 
> entire LRU before returning ‘out of memory’. I’m not
> thinking that this is a big problem (because if you have a few tens of 
> thousands of items, that’s pretty quick… and if you have
> millions… well, why do you have millions of items refcounted?), but worth at 
> least noting. 
>
> I was going to suggest a change to make it fit into the ‘tries’ loop better 
> so those moves got counted as a try, but there
> doesn’t seem to be a particularly clean way to do that, so I’m willing to 
> just accept it as a limitation that might get hit in
> situations far worse than the one that’s causing us issues right now. I’m 
> okay with that. 
>
> I haven’t tried the patch under production load yet, because I wanted to have 
> stats to give us some information about what was
> going on under the hood. I finally got a chance to add in an additional stat 
> for refcounted items on the tail — I sent you a PR
> with that patch (https://github.com/dormando/memcached/pull/1). I *think* I 
> got the right things in the right places, though you
> may take issue with the stat name (“reflocked”).
>
> Now that I have the stats, I’m going to work on putting a patched copy out 
> under production load to make sure it holds up there,
> and at least see about artificially generating one of the hung-get situations 
> that was causing us problems. I’ll let you know
> how that works out!
>
> -j
>
>
> On Mon, Aug 11, 2014 at 8:54 PM, dormando <[email protected]> wrote:
>
>       > > Well, sounds like whatever process was asking for that data is dead 
> (and> possibly pissing off a customer) so
>       you should
>       > indeed figure out what
>       > > that's about.
>       >
>       > Yeah, we’ll definitely hunt this one down. I’ll have to toss up a 
> monitor to look for things in a write state for
>       extended
>       > periods and then go do some tracing (rather than, say, waiting for it 
> to actually break again). We *do* have some
>       legitimately
>       > long-running (multi-hour) things going on, so can’t just say “long 
> connection bad!”, but it would be nice if maybe
>       those
>       > processes could slurp their entire response upfront or some such.
>       >
>       >
>       > > I think another thing we can do is actually throw a 
> refcounted-for-a-long-time 
>       > > item back to the front of the LRU. I'll try a patch for that this 
> weekend. It should
>       > > have no real overhead compared to other approaches of timing out 
> connections.
>       >
>       > Is there any reason you can’t do “if refcount > 1 when walking the 
> end of the tail, send to the front” without
>       requiring
>       > ‘refcounted for a long time’ (with, of course, still limiting it to 
> 5ish actions)? It seems like this would be
>       pretty safe,
>       > since generally stuff at the end of LRU shouldn’t have a refcount, 
> and then you don’t need extra code for figuring
>       out how long
>       > something has been refcounted.
>       >
>       > I guess there’s a slightly degenerate case in there, which is that if 
> you have a small slab that’s 100%
>       refcounted, you end up
>       > cycling a bunch of pointers every write just to run the LRU in a big 
> circle and never write anything (similar to
>       the case you
>       > suggest in your last paragraph), but that’s a situation I’m totally 
> willing to accept. ;)
>       >
>       > Anyhow, looking forward to a patch, and will gladly help test!
>       >
>
> Here, try out this branch:
> https://github.com/dormando/memcached/tree/refchuck
>
> It needs some cleanup and sanity checking. I want to redo the loop instead
> of the weird goto, add an arg to item_update intead of copypasta, and add
> one or two sanity checks to break the loop if you're trying to alloc out
> of a class that's 100% reflocked.
>
> I added a test that works okay. Fails before, runs after. Can you try
> this on one or two machines and see what the impact is?
>
> If it works okay I'll clean it up and merge. Need to spend a little more
> time on the PR queue before I can cut though.
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: tail repair issue (1.4.20)

Reply via email to