Re: tail repair issue (1.4.20)

dormando Sat, 09 Aug 2014 02:10:22 -0700


On Fri, 8 Aug 2014, Jay Grizzard wrote:


> Okay, I'm pretty sure I understand what's going on here now.
>
> This is what I think is the sequence of events is:
>
> - Client does gets for a very large number of keys. I'm not sure how to 
> actually see the request in the core (if that data is
> even still attached), but isize ("list of items to write out") is 3200. I'm 
> assuming that's the list of items pending for write,
> anyhow.
> - All the items to be written get refcount++ and queued for delivery. Some of 
> these items are on the tail (or get moved there at
> some point)
> - At some point during transmission, the client system either stops 
> processing, or starts processing *so* slowly it may as well
> have stopped.
> - The connection sits there and stays healthy (since the client is still 
> online), but makes little/no progress, so the
> connection is essentially permanently in a mwrite state, keeping all the 
> items on the transmit list permanently referenced.
> - In the meantime, the tail gets used as normal, and as actually free entries 
> get used, these referenced entries 'bubble up'
> until they occupy the first 5ish slots
> - Presto, that slab no longer accepts writes until something happens to force 
> a TCP disconnect (client process crashing), or the
> processing of the response is actually completed.

Neat. That would explain it.

>
> ...the connection in specific I was looking at above was fd 2089, the one in 
> the conn_mwrite state for 10917 seconds in the
> previously attached files.
>
> Walking the tail for slab 16 (the hung slab), the first 20 entries have 
> refcount=2, before finally finding a refcount=1 at entry
> 21.
>
> If I take the first half dozen of those or so (that's all I tried), I can 
> find every single one of them listed in the array at
> conns[2089]->ilist

yuup.

> So unless I'm reading something horribly wrong (which I may be, only 
> passingly familiar with memcached internals), that's why
> we're breaking.

Yeah.

> Now, how to *fix* this, I'm not sure about. Obviously the client should be 
> actually processing the data it's getting sent, in a
> timely manner. And requesting thousands of keys may or may not be sane. But 
> regardless of that, it still probably shouldn't
> break the server.

Well, sounds like whatever process was asking for that data is dead (and
possibly pissing off a customer) so you should indeed figure out what
that's about. Sounds like you know what you're doing, and can go track
down what's on the far side of that fd. Give it a good strace and see what
it's up to.

> An inactivity timer might help, as long as it's willing to kill connections 
> that are still in a writing state. That wouldn't
> actually *fix* the problem, but would certainly decrease the odds of it 
> happening to a point that it could be considered "fixed"
> for most practical purposes.
>
> What do you think?

Someone had sent a pull request to drop idle connections, but I asked them
to rewrite it on top of the conn tracking work since it would be a lot
more efficient running from a background thread. Otherwise you have to
make billions of timers or run lists of things and that can get a little
out of hand speed-wise. That was only meant to kill idle connections I
think, killing something in-flight is more complicated.

I think the guy on the pull request disappeared, but it might be worth
finishing it, or finding some similar approach for long running queries.

I think another thing we can do is actually throw a
refcounted-for-a-long-time item back to the front of the LRU. I'll try a
patch for that this weekend. It should have no real overhead compared to
other approaches of timing out connections.

It won't be fixable in a 100% scenario I think. If you have a slab with
one page in it, and that item is "active" in some way, you won't be able
to upload new data. You're just legitimately out of memory at that point.

> -j
>
>
> On Thu, Aug 7, 2014 at 5:17 PM, dormando <[email protected]> wrote:
>       Thanks! It might take me a while to look into it more closely.
>
>       That conn_mwrite is probably bad, however a single connection shouldn't 
> be
>       able to do it. Before the OOM is given up, memcached walks up the chain
>       from the bottom of the LRU by 5ish. So all of them have to be locked, or
>       possibly some thing I'm unaware of.
>
>       Great that you have some cores. Can you look at the tail of the LRU for
>       the slab which was OOM'ing, and print the item struct there? If 
> possible,
>       walk up 5-10 items back from the tail and print each (anonymized, of
>       course). It'd be useful to see the refcount and flags on the items.
>
>       Have you tried re-enabling tailrepairs on one of your .20 instances? It
>       could still crash sometimes, but you can set the timeout to a reasonably
>       low number and see if that helps at all while we figure this out.
>
>       On Thu, 7 Aug 2014, Jay Grizzard wrote:
>
>       > (I work with Denis, who is out of town this week)
>       > So we finally got a more proper 1.4.20 deployment going, and we’ve 
> seen this issue quite a lot over the past week.
>       When it
>       > happened this morning I was able to grab what you requested.
>       >
>       > I’ve included a couple of “stats conn” dumps, with anonymized 
> addresses, taken four minutes apart. It looks like
>       there’s one
>       > connection that could possibly be hung:
>       >
>       >   STAT 2089:state conn_mwrite
>       >
>       > …would that be enough to cause this problem? (I’m assuming the answer 
> is “it depends”) I snagged a core file from
>       the process
>       > that I should be able to muck through to answer questions if there’s 
> somewhere in there we would find useful
>       information.
>       >
>       > Worth noting that while we’ve been able to reproduce the hang (a 
> single slab starts reporting oom for every
>       write), we haven’t
>       > reproduced the “but recovers on its own” part because these are 
> production servers and the problem actually causes
>       real issues,
>       > so we restart them rather than waiting several hours to see if the 
> problem clears up. 
>       >
>       > Also, reading up in the thread, it’s worth noting that lack of TCP 
> keepalives (which we actually have, memcached
>       enables it)
>       > wouldn’t actually affect the “and automatically recover” aspect of 
> things, because TCP keepalives only happen when
>       a connection
>       > is completely idle. When there’s pending data (which there would be 
> on a hung write), standard TCP timeouts (which
>       are much
>       > faster) apply.
>       >
>       > (And yes, we do have lots of idle connections to our caches, but 
> that’s not something we can immediately fix, nor
>       should it
>       > directly be the cause of these issues.)
>       >
>       > Anyhow… thoughts?
>       >
>       > -j
>       >
> > --
> >
> > ---
> > You received this message because you are subscribed to the Google Groups 
> > "memcached" group.
> > To unsubscribe from this group and stop receiving emails from it, send an 
> > email to
> [email protected].
> > For more options, visit https://groups.google.com/d/optout.
> >
> >
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: tail repair issue (1.4.20)

Reply via email to