Re: memcached-1.4.20 stuck when "Too many open connections"

dormando Fri, 31 Oct 2014 00:02:03 -0700

Hey,

How are you reproducing this? How many connections do you typically have
open?


It's really bizarre that your curr_conns is "5", but your connections are
disabled? Even if there's still a race, as more connections close they
each have an opportunity to flip the acceptor back on.

Can you print what "stats settings" shows? If it's adjusting your actual
maxconns downward it should show there...

On Wed, 29 Oct 2014, Samdy Sun wrote:

> There are no deadlocks, (gdb) info thread
> * 5 Thread 0xf7771b70 (LWP 24962)  0x080509dd in transmit (fd=431, which=2, 
> arg=0xfef8ce48)
>     at memcached.c:4044
>   4 Thread 0xf6d70b70 (LWP 24963)  0x007ad430 in __kernel_vsyscall ()
>   3 Thread 0xf636fb70 (LWP 24964)  0x007ad430 in __kernel_vsyscall ()
>   2 Thread 0xf596eb70 (LWP 24965)  0x007ad430 in __kernel_vsyscall ()
>   1 Thread 0xf77b38d0 (LWP 24961)  0x007ad430 in __kernel_vsyscall ()
> (gdb) t 1
> [Switching to thread 1 (Thread 0xf77b38d0 (LWP 24961))]#0  0x007ad430 in 
> __kernel_vsyscall ()
> (gdb) bt
> #0  0x007ad430 in __kernel_vsyscall ()
> #1  0x005c5366 in epoll_wait () from /lib/libc.so.6
> #2  0x0074a750 in epoll_dispatch (base=0x9305008, arg=0x93053c0, 
> tv=0xff8e0cdc) at epoll.c:198
> #3  0x0073d714 in event_base_loop (base=0x9305008, flags=0) at event.c:538
> #4  0x08054467 in main (argc=19, argv=0xff8e2274) at memcached.c:5795
> (gdb) 
>
> (gdb) t 2
> [Switching to thread 2 (Thread 0xf596eb70 (LWP 24965))]#0  0x007ad430 in 
> __kernel_vsyscall ()
> (gdb) bt
> #0  0x007ad430 in __kernel_vsyscall ()
> #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
> #2  0x08055662 in slab_rebalance_thread (arg=0x0) at slabs.c:859
> #3  0x00a61a49 in start_thread () from /lib/libpthread.so.0
> #4  0x005c4aee in clone () from /lib/libc.so.6
> (gdb) t 3
> [Switching to thread 3 (Thread 0xf636fb70 (LWP 24964))]#0  0x007ad430 in 
> __kernel_vsyscall ()
> (gdb) bt
> #0  0x007ad430 in __kernel_vsyscall ()
> #1  0x005838b6 in nanosleep () from /lib/libc.so.6
> #2  0x005836e0 in sleep () from /lib/libc.so.6
> #3  0x08056f6e in slab_maintenance_thread (arg=0x0) at slabs.c:819
> #4  0x00a61a49 in start_thread () from /lib/libpthread.so.0
> #5  0x005c4aee in clone () from /lib/libc.so.6
> (gdb) t 4
> [Switching to thread 4 (Thread 0xf6d70b70 (LWP 24963))]#0  0x007ad430 in 
> __kernel_vsyscall ()
> (gdb) bt
> #0  0x007ad430 in __kernel_vsyscall ()
> #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
> #2  0x080599f5 in assoc_maintenance_thread (arg=0x0) at assoc.c:251
> #3  0x00a61a49 in start_thread () from /lib/libpthread.so.0
> #4  0x005c4aee in clone () from /lib/libc.so.6
> (gdb) t 5
> [Switching to thread 5 (Thread 0xf7771b70 (LWP 24962))]#0  0x007ad430 in 
> __kernel_vsyscall ()
> (gdb) bt
> #0  0x007ad430 in __kernel_vsyscall ()
> #1  0x00a68998 in sendmsg () from /lib/libpthread.so.0
> #2  0x080509dd in transmit (fd=431, which=2, arg=0xfef8ce48) at 
> memcached.c:4044
> #3  drive_machine (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4370
> #4  event_handler (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4441
> #5  0x0073d9e4 in event_process_active (base=0x9310658, flags=0) at 
> event.c:395
> #6  event_base_loop (base=0x9310658, flags=0) at event.c:547
> #7  0x08059fee in worker_libevent (arg=0x930c698) at thread.c:471
> #8  0x00a61a49 in start_thread () from /lib/libpthread.so.0
> #9  0x005c4aee in clone () from /lib/libc.so.6
> (gdb) 
>
> strace info, there is the only event named maxconnsevent on epoll?
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 10084037}) = 0
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 20246365}) = 0
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 30382098}) = 0
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 40509766}) = 0
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 50657403}) = 0
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 60823841}) = 0
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 71013006}) = 0
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 81234264}) = 0
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 91407508}) = 0
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 101581187}) = 0
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 111752457}) = 0
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 121919049}) = 0
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 132057597}) = 0
>
>
>
> 在 2014年10月29日星期三UTC+8下午2时47分23秒，Samdy Sun写道：
>       Hello,  I got a "memcached-1.4.20 stuck" problem when EMFILE happen.
>   Here are my memcached's cmdline "memcached -s /xxx/mc_usock.11201 -c 1024 
> -m 4000 -f 1.05 -o slab_automove -o slab_reassign  -t 1 -p
> 11201".
>  
>   cat /proc/version 
>   Linux version 2.6.32-358.el6.x86_64 
> ([email protected]) (gcc version 4.4.7 20120313 (Red 
> Hat 4.4.7-3) (GCC) ) #1
> SMP Tue Jan 29 11:47:41 EST 2013
>
>   memcached-1.4.20 stuck and don't work any more when it runs for a period of 
> time.
>
>   Here are some information for gdb:  (gdb) p stats
>   $2 = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __kind = 0, 
> __nusers = 0, {__spins = 0, 
>         __list = {__next = 0x0}}}, __size = '\000' <repeats 23 times>, 
> __align = 0}, curr_items = 149156, 
>   total_items = 9876811, curr_bytes = 3712501870, curr_conns = 5, total_conns 
> = 39738, rejected_conns = 0, 
>   malloc_fails = 0, reserved_fds = 5, conn_structs = 1012, get_cmds = 0, 
> set_cmds = 0, touch_cmds = 0, 
>   get_hits = 0, get_misses = 0, touch_hits = 0, touch_misses = 0, evictions = 
> 0, reclaimed = 0, 
>   started = 0, accepting_conns = false, listen_disabled_num = 1, 
> hash_power_level = 17, 
>   hash_bytes = 524288, hash_is_expanding = false, expired_unfetched = 0, 
> evicted_unfetched = 0, 
>   slab_reassign_running = false, slabs_moved = 20, lru_crawler_running = 
> false, 
>   disable_write_by_exptime = 0, disable_write_by_length = 0, 
> disable_write_by_access = 0, 
>   evicted_write_reply_timeout_times = 0}
>
>   (gdb) p allow_new_conns
>   $4 = false
>
>   And I found that "allow_new_conns" just set to false when "accept" failed 
> and errno is "EMFILE". 
>   Here are the codes:  
> static void drive_machine(conn *c) {
>                  ……
>                  } else if (errno == EMFILE) {
>                    if (settings.verbose > 0)
>                          fprintf(stderr, "Too many open connections\n");
>                    accept_new_conns(false);
>                    stop = true;
>                  } else {
>                  ……
> }
>   
>   If I change the flag "allow_new_conns", it can work again. As below:
>   (gdb) set allow_new_conns=1
>   (gdb) p allow_new_conns
>   $5 = true
>   (gdb) c
>   Continuing.
>
>   I know that "allow_new_conns" will be set to "true" when "conn_close" 
> called. But how could it happen for the case that when "accept"
> failed , and errno is "EMFILE", and this connection is the only one for 
> accepting. Notes that curr_conns = 5.
>   Not run out of fd:
>   ls /proc/1748(memcached_pid)/fd | wc -l
>   17
>   
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: memcached-1.4.20 stuck when "Too many open connections"

Reply via email to