Re: tail repair issue (1.4.20)

dormando Mon, 11 Aug 2014 12:57:42 -0700

Apparently I lied about the weekend, sorry...

On Mon, 11 Aug 2014, Jay Grizzard wrote:


> > Well, sounds like whatever process was asking for that data is dead (and> 
> >possibly pissing off a customer) so you should
> indeed figure out what
> > that's about.
>
> Yeah, we’ll definitely hunt this one down. I’ll have to toss up a monitor to 
> look for things in a write state for extended
> periods and then go do some tracing (rather than, say, waiting for it to 
> actually break again). We *do* have some legitimately
> long-running (multi-hour) things going on, so can’t just say “long connection 
> bad!”, but it would be nice if maybe those
> processes could slurp their entire response upfront or some such.

Good luck!

>
> > I think another thing we can do is actually throw a 
> > refcounted-for-a-long-time 
> > item back to the front of the LRU. I'll try a patch for that this weekend. 
> > It should
> > have no real overhead compared to other approaches of timing out 
> > connections.
>
> Is there any reason you can’t do “if refcount > 1 when walking the end of the 
> tail, send to the front” without requiring
> ‘refcounted for a long time’ (with, of course, still limiting it to 5ish 
> actions)? It seems like this would be pretty safe,
> since generally stuff at the end of LRU shouldn’t have a refcount, and then 
> you don’t need extra code for figuring out how long
> something has been refcounted.
>
> I guess there’s a slightly degenerate case in there, which is that if you 
> have a small slab that’s 100% refcounted, you end up
> cycling a bunch of pointers every write just to run the LRU in a big circle 
> and never write anything (similar to the case you
> suggest in your last paragraph), but that’s a situation I’m totally willing 
> to accept. ;)
>
> Anyhow, looking forward to a patch, and will gladly help test!
>

Thanks!

I'm going back and forth on it honestly. I think it should only move it if
it's been at least UPDATE_INTERVAL since it last moved it, possibly
UPDATE_INTERVAL * 4.

Given your case of "I have a bajillion objects ref'ed by this one
connection", and the fact that the allocator only walks five up in the
history before giving up, I have two main options:

1) throw the bottom 5 to the top, then give up (and do that for each
allocation forever, which can slow down all writes by holding the central
cache lock for longer). That'll still cause a number of OOM's while it
tries to clear your 9,000000 ref'ed objects from the bottom (yeah I know
it's only 3200ish)

2) If refcounted + last_update < now + UPDATE_INTERVAL*N -> flip to top
and don't count that as a try. This will cause memcached to have a very
brief hiccup when it lands on the pile of objects, but won't cause an OOM
and won't flip around forever.

It also avoids a pathological regression if someone hammers a slab class
stuck in this state (and path #1 was chosen).

If you have teeny slab classes you're likely to be screwed either way, so
the extra time interval doesn't hurt you much more than you would anyway.
I assume/hope objects that you've been fetching take more than a couple
minutes to hit the bottom of the slab class. If they do, your evictions
are probably nutters and hit rate crap anyway; you'd need more ram.

So yeah. leaning toward #2? Different definition of "refcounted for a long
time" compare to what tail_repairs defaulted to. Much shorter.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: tail repair issue (1.4.20)

Reply via email to