Re: tail repair issue (1.4.20)

Jay Grizzard Thu, 21 Aug 2014 12:03:17 -0700

Hi, sorry about the slow response. Naturally, the daily problem we were
having stopped as soon as you checked in that patch. Typical, eh?

Anyhow, I’ve studied the patch and it seems to be pretty good — the only
worry I have is that if you end up with the extremely degenerate case of an
entire LRU being refcounted, you have to walk the entire LRU before
returning ‘out of memory’. I’m not thinking that this is a big problem
(because if you have a few tens of thousands of items, that’s pretty quick…
and if you have millions… well, why do you have millions of items
refcounted?), but worth at least noting.

I was going to suggest a change to make it fit into the ‘tries’ loop better
so those moves got counted as a try, but there doesn’t seem to be a
particularly clean way to do that, so I’m willing to just accept it as a
limitation that might get hit in situations far worse than the one that’s
causing us issues right now. I’m okay with that.

I haven’t tried the patch under production load yet, because I wanted to
have stats to give us *some* information about what was going on under the
hood. I finally got a chance to add in an additional stat for refcounted
items on the tail — I sent you a PR with that patch (
https://github.com/dormando/memcached/pull/1). I *think* I got the right
things in the right places, though you may take issue with the stat name
(“reflocked”).

Now that I have the stats, I’m going to work on putting a patched copy out
under production load to make sure it holds up there, and at least see
about artificially generating one of the hung-get situations that was
causing us problems. I’ll let you know how that works out!

-j

On Mon, Aug 11, 2014 at 8:54 PM, dormando <[email protected]> wrote:

>
> > > Well, sounds like whatever process was asking for that data is dead
> (and> possibly pissing off a customer) so you should
> > indeed figure out what
> > > that's about.
> >
> > Yeah, we’ll definitely hunt this one down. I’ll have to toss up a
> monitor to look for things in a write state for extended
> > periods and then go do some tracing (rather than, say, waiting for it to
> actually break again). We *do* have some legitimately
> > long-running (multi-hour) things going on, so can’t just say “long
> connection bad!”, but it would be nice if maybe those
> > processes could slurp their entire response upfront or some such.
> >
> >
> > > I think another thing we can do is actually throw
> a refcounted-for-a-long-time
> > > item back to the front of the LRU. I'll try a patch for that this
> weekend. It should
> > > have no real overhead compared to other approaches of timing out
> connections.
> >
> > Is there any reason you can’t do “if refcount > 1 when walking the end
> of the tail, send to the front” without requiring
> > ‘refcounted for a long time’ (with, of course, still limiting it to 5ish
> actions)? It seems like this would be pretty safe,
> > since generally stuff at the end of LRU shouldn’t have a refcount, and
> then you don’t need extra code for figuring out how long
> > something has been refcounted.
> >
> > I guess there’s a slightly degenerate case in there, which is that if
> you have a small slab that’s 100% refcounted, you end up
> > cycling a bunch of pointers every write just to run the LRU in a big
> circle and never write anything (similar to the case you
> > suggest in your last paragraph), but that’s a situation I’m totally
> willing to accept. ;)
> >
> > Anyhow, looking forward to a patch, and will gladly help test!
> >
>
> Here, try out this branch:
> https://github.com/dormando/memcached/tree/refchuck
>
> It needs some cleanup and sanity checking. I want to redo the loop instead
> of the weird goto, add an arg to item_update intead of copypasta, and add
> one or two sanity checks to break the loop if you're trying to alloc out
> of a class that's 100% reflocked.
>
> I added a test that works okay. Fails before, runs after. Can you try
> this on one or two machines and see what the impact is?
>
> If it works okay I'll clean it up and merge. Need to spend a little more
> time on the PR queue before I can cut though.
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: tail repair issue (1.4.20)

Reply via email to