The commandline arguments used are:

-u memcached -m 236544 -c 64000 -p 11211 -t 32 -C -n 5 -f 1.05 -o
ext_path=/mnt/memcache:1700G,ext_path=/mnt1/memcache:1700G,ext_path=/mnt2/memcache:1700G,ext_path=/mnt3/memcache:1700G,ext_threads=32,ext_item_size=64

And we have some data, but frankly when the issue was happening we focused
on the Memcache servers late in the process.  The initial errors suggested
the problem was in a very different part of the collection of technical
components in our larger service.  When we realized the "memcached"
processes were not accepting new connections, we wanted to correct the
behavior quickly since a fair amount of time had already passed.

First, sockets in use on the servers...

One system, call it server-248, shows TCP sockets on the system hovering
around 1900 after traffic ramped up for the day.  If held at that level
from ~6:45am until 10:06am.  We collect SAR data every 2 minutes, so the
next reading was at 10:08 and the TCP sockets jumped to 63842.  Meaning it
didn't grow slowly over time, it jumped frlmo 1937 to 63842 in a 2 minute
window.  That number was 63842-63844 until 12:06pm when we restarted the
"memcached" process.  After that the number dropped over time back to a
more typical level.

10:02am   1937
10:04am   1937
10:06am   1937
10:08am  63842
10:10am  63842
...etc...
12:08pm  63843
12:06pm  63844
12:08pm  18415
12:10pm  17202
12:12pm  16333
12:14pm  16197
12:16pm  16134
12:18pm  16099
12:20pm   1617

The other system which ran into trouble, which I'll call server-85,
exhibited similar behavior but started later.  Here's a sample of TCP
socket counts from that server.

11:30am   1805
11:32am   1801
11:34am   1830
11:36am   1817
11:38am  63905
11:40am  63905
...etc...
12:20pm  63908
12:22pm  63908
12:24pm   1708
12:26pm   1720
12:28pm   1747

There were other network-centric datapoints that show the systems grind to
a halt in terms of accepting new connections, like the bandwidth going
in/out of the NIC's, etc...  But it's all in support of the same idea, that
the "memcached" server stopped accepting new connections.

Second, details about the sockets...

During the incident, we did capture summary information on the socket
states from various servers on the network, and a full "netstat -an"
listing from one of the servers.  Both server-248 and server-85 showed tens
of thousands of sockets in a CLOSE_WAIT state, hundreds in a SYN_RCVD
state, and a small number of ESTABLISHED sockets.

There may continue to be traffic on the ESTABLISHED connections to the
"memcached" servers, but if there is it's a trivial amount.  Multiple
people were running "memkeys" at the time and report seeing no activity.

Third, "stats" from the incapacitated "memcached" processes...

We do not have stats from either server-248 or server-85 during the time
they were in trouble.  In hindsight, that was a big oversight.
It's not clear that we could have gotten a connection to the server to pull
the stats, but I'd really like to know what those counters said!

I do have the results from "stats", "stats slabs", "stats items" and "stats
conns" from 17:10 the previous evening.  That doesn't show any obvious
errors/problems slowly building up, waiting for some event to trigger a
massive failure.  But it's from ~15 hours before the server got into
trouble so I don't think it's all that helpful.

-jj

On Fri, Oct 12, 2018 at 6:24 PM dormando <[email protected]> wrote:

> Hey,
>
> Probably unrelated to the original thread. I'm highly interested in
> understanding what's going on here though, and would appreciate any help
> you can spare.
>
> What data did you collect? The more information you can share the better:
>
> 1) start arguments for the daemon
> 2) stats output (stats, stats items, stats slabs, and importantly "stats
> conns" - you will probably have to redact some of the output.
> 3) were you still able to connect to and run commands on those daemons
> while they had a mass of CLOSE_WAIT?
>
> Thanks,
> -Dormando
>
> On Fri, 12 Oct 2018, [email protected] wrote:
>
> > We're testing verison 1.5.10 as a possible upgrade candidate for out
> older
> > memcached servers, using a pool 9 servers.  They are running in parallel
> > with the production pool, also 9 servers.  For the test all read requests
> > are going to the production pool, and all updates (set, delete, etc...)
> are
> > sent to one server in production pool and one server in the 1.5.10 pool
> via
> > the key hashing algorithm.
> >
> > That setup had been running without incident for about 12 days then
> > yesterday two of the servers experienced the mass of CLOSE_WAIT
> connections
> > similar to what's been described here.  We were able to collect some
> data,
> > but not enough to figure out what's happening.  So I'm hoping to
> kickstart a
> > discussion here about how to diagnose what's going on.  Until we can find
> > way to explain (and prevent) another problem like this, we're unable to
> > upgrade.
> >
> > I can provide more information about our configuration.  I'm just not
> sure
> > what bits are useful/interesting.  I will note that we're using
> "extstore"
> > functionality on the new servers.
> >
> > -jj
> >
> > --
> >
> > ---
> > You received this message because you are subscribed to the Google Groups
> > "memcached" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to [email protected].
> > For more options, visit https://groups.google.com/d/optout.
> >
> >
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to