Re: memcached-1.4.20 stuck when "Too many open connections"

Samdy Sun Wed, 05 Nov 2014 23:33:59 -0800

Additionally,  the stats counter of malloc_fails is ok.

(gdb) p stats


$1 = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __kind = 0, 
__nusers = 0, {__spins = 0, 

        __list = {__next = 0x0}}}, __size = '\000' <repeats 23 times>, __align 
= 0}, curr_items = 190673, 

  total_items = 7728815, curr_bytes = 3765464159, curr_conns = 5, total_conns = 
28838, rejected_conns = 0, 

  *malloc_fails = 0*, reserved_fds = 5, *conn_structs = 1012*
, get_cmds = 0, set_cmds = 0, touch_cmds = 0, 

  get_hits = 0, get_misses = 0, touch_hits = 0, touch_misses = 0, evictions = 
0, reclaimed = 0, 

  started = 0, *accepting_conns = false*, *listen_disabled_num = 1*
, hash_power_level = 17, 

  hash_bytes = 524288, hash_is_expanding = false, expired_unfetched = 0, 
evicted_unfetched = 0, 

  slab_reassign_running = false, slabs_moved = 19, lru_crawler_running = false, 

  disable_write_by_exptime = 0, disable_write_by_length = 0, 
disable_write_by_access = 0, 

  evicted_write_reply_timeout_times = 0}

  
在 2014年11月6日星期四UTC+8下午3时21分31秒，Samdy Sun写道：
>
> @Dormando,
>   
>   I think it's not because of malloc fails. 
>   I use "-m 200" and there is 48G ram of my server.
>   After close(c->sfd), I printf a message as "fprintf(stderr, 
> "INFO`fd=%d`reason=%s`allow_new_conns=true\n", c->sfd, "conn_close");"
>   Here are the codes,
> *  static void conn_close(conn *c) {*
> *     ……*
> *     close(c->sfd);*
>  
> *     pthread_mutex_lock(&conn_lock);*
> *     allow_new_conns = true;*
> *     pthread_mutex_unlock(&conn_lock);*
> *     // XXX: tracing fd close*
> *     fprintf(stderr, "INFO`fd=%d`reason=%s`allow_new_conns=true\n", 
> c->sfd, "conn_close");*
> *     ……*
> *  }*
>
>   Here are the logs,
> *  ……*
> *INFO`fd=816`reason=conn_close`allow_new_conns=true*
> *INFO`fd=817`reason=conn_close`allow_new_conns=true*
> *Failed to write, and not due to blocking: Broken pipe*
> *INFO`fd=818`reason=conn_close`allow_new_conns=true*
> *INFO`fd=819`reason=conn_close`allow_new_conns=true*
> *Failed to write, and not due to blocking: Broken pipe*
> *INFO`fd=820`reason=conn_close`allow_new_conns=true*
> *Failed to write, and not due to blocking: Broken pipe*
> *INFO`fd=821`reason=conn_close`allow_new_conns=true*
> *Failed to write, and not due to blocking: Broken pipe*
> *INFO`fd=822`reason=conn_close`allow_new_conns=true*
> *Failed to write, and not due to blocking: Broken pipe*
> *  ……*
> *INFO`fd=1011`reason=conn_close`allow_new_conns=true*
> *INFO`fd=1012`reason=conn_close`allow_new_conns=true*
> *INFO`fd=1013`reason=conn_close`allow_new_conns=true*
> *INFO`fd=1014`reason=conn_close`allow_new_conns=true*
> *INFO`fd=1015`reason=conn_close`allow_new_conns=true*
> *INFO`fd=1016`reason=conn_close`allow_new_conns=true*
> *INFO`fd=1017`reason=conn_close`allow_new_conns=true*
> *INFO`fd=1018`reason=conn_close`allow_new_conns=true*
> *INFO`fd=1019`reason=conn_close`allow_new_conns=true*
> *INFO`fd=1020`reason=conn_close`allow_new_conns=true*
> *INFO`fd=1021`reason=conn_close`allow_new_conns=true*
> *INFO`fd=1022`reason=conn_close`allow_new_conns=true*
> *INFO`fd=1023`reason=conn_close`allow_new_conns=true*
> *accept(): Too many open files*
> *Too many open connections*
> *INFO`fd=-1`errno=24(Too many open files)`reason=Too many open connections*
> *allow_new_conns=false*
>
>
> It's very strange that "accept()" fail after close(fd) has called. The 
> flag called "*allow_new_conns*" is finally set to false, so memcached is 
> stuck.
> But when I found that memcached was stuck, the used fd count are small.
> ls /proc/6873/fd|wc -l
> 17
>
>
>
> 在 2014年11月6日星期四UTC+8上午8时08分50秒，Dormando写道：
>>
>> There're too many things that will go wrong if malloc fails... 
>>
>> There's a stats counter covering some of them. Is that going up for you? 
>>
>> Have you disabled overcommit memory? Have you observed the process size 
>> when it hangs? malloc should almost never actually fail under normal 
>> conditions... 
>>
>> On Wed, 5 Nov 2014, Samdy Sun wrote: 
>>
>> > Hey,   
>> >   I also got a stuck when specifing "-m 200".  
>> >   As mentioned previously, that case could happend as below? 
>> >   1. malloc fails when "conn_new()"  2. event_add fails when 
>> "conn_new()" 
>> >   3. other case? 
>> > 
>> >   And I find another case after code reviewing. Here is, memcached 
>> stuck for a while, for which our client close the connection because 
>> > 200ms-timeout. So, if the previous 1023 connections get timeout and 
>> memcached calls "transmit" to write, "Broken pipe" error will happend. And 
>> > then, memcached get "TRANSMIT_HARD_ERROR" error and calls 
>> "conn_close" immediately. 
>> >   So, it will happend as below? 
>> >   accept(), errno == EMFILE 
>> >   fd1 close, 
>> >   fd2 close, 
>> >   fd3 close, 
>> >   …… 
>> >   fd1023 close, 
>> >   accept_new_conns(false) for EMFILE 
>> > 
>> >   That just is a supposition, but I will try to log some infomation 
>> to prove it. 
>> >    
>> >   Any way, is it better to call "conn_close" after for a while, such as 
>> waiting for next event when getting TRANSMIT_HARD_ERROR error then to 
>> > conn_close immediately? 
>> >    
>> > 
>> > 在 2014年10月31日星期五UTC+8下午3时01分06秒，Dormando写道： 
>> >       Hey, 
>> > 
>> >       How are you reproducing this? How many connections do you 
>> typically have 
>> >       open? 
>> > 
>> >       It's really bizarre that your curr_conns is "5", but your 
>> connections are 
>> >       disabled? Even if there's still a race, as more connections close 
>> they 
>> >       each have an opportunity to flip the acceptor back on. 
>> > 
>> >       Can you print what "stats settings" shows? If it's adjusting your 
>> actual 
>> >       maxconns downward it should show there... 
>> > 
>> >       On Wed, 29 Oct 2014, Samdy Sun wrote: 
>> > 
>> >       > There are no deadlocks, (gdb) info thread 
>> >       > * 5 Thread 0xf7771b70 (LWP 24962)  0x080509dd in transmit 
>> (fd=431, which=2, arg=0xfef8ce48) 
>> >       >     at memcached.c:4044 
>> >       >   4 Thread 0xf6d70b70 (LWP 24963)  0x007ad430 in 
>> __kernel_vsyscall () 
>> >       >   3 Thread 0xf636fb70 (LWP 24964)  0x007ad430 in 
>> __kernel_vsyscall () 
>> >       >   2 Thread 0xf596eb70 (LWP 24965)  0x007ad430 in 
>> __kernel_vsyscall () 
>> >       >   1 Thread 0xf77b38d0 (LWP 24961)  0x007ad430 in 
>> __kernel_vsyscall () 
>> >       > (gdb) t 1 
>> >       > [Switching to thread 1 (Thread 0xf77b38d0 (LWP 24961))]#0 
>>  0x007ad430 in __kernel_vsyscall () 
>> >       > (gdb) bt 
>> >       > #0  0x007ad430 in __kernel_vsyscall () 
>> >       > #1  0x005c5366 in epoll_wait () from /lib/libc.so.6 
>> >       > #2  0x0074a750 in epoll_dispatch (base=0x9305008, 
>> arg=0x93053c0, tv=0xff8e0cdc) at epoll.c:198 
>> >       > #3  0x0073d714 in event_base_loop (base=0x9305008, flags=0) at 
>> event.c:538 
>> >       > #4  0x08054467 in main (argc=19, argv=0xff8e2274) at 
>> memcached.c:5795 
>> >       > (gdb)  
>> >       > 
>> >       > (gdb) t 2 
>> >       > [Switching to thread 2 (Thread 0xf596eb70 (LWP 24965))]#0 
>>  0x007ad430 in __kernel_vsyscall () 
>> >       > (gdb) bt 
>> >       > #0  0x007ad430 in __kernel_vsyscall () 
>> >       > #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
>> /lib/libpthread.so.0 
>> >       > #2  0x08055662 in slab_rebalance_thread (arg=0x0) at 
>> slabs.c:859 
>> >       > #3  0x00a61a49 in start_thread () from /lib/libpthread.so.0 
>> >       > #4  0x005c4aee in clone () from /lib/libc.so.6 
>> >       > (gdb) t 3 
>> >       > [Switching to thread 3 (Thread 0xf636fb70 (LWP 24964))]#0 
>>  0x007ad430 in __kernel_vsyscall () 
>> >       > (gdb) bt 
>> >       > #0  0x007ad430 in __kernel_vsyscall () 
>> >       > #1  0x005838b6 in nanosleep () from /lib/libc.so.6 
>> >       > #2  0x005836e0 in sleep () from /lib/libc.so.6 
>> >       > #3  0x08056f6e in slab_maintenance_thread (arg=0x0) at 
>> slabs.c:819 
>> >       > #4  0x00a61a49 in start_thread () from /lib/libpthread.so.0 
>> >       > #5  0x005c4aee in clone () from /lib/libc.so.6 
>> >       > (gdb) t 4 
>> >       > [Switching to thread 4 (Thread 0xf6d70b70 (LWP 24963))]#0 
>>  0x007ad430 in __kernel_vsyscall () 
>> >       > (gdb) bt 
>> >       > #0  0x007ad430 in __kernel_vsyscall () 
>> >       > #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
>> /lib/libpthread.so.0 
>> >       > #2  0x080599f5 in assoc_maintenance_thread (arg=0x0) at 
>> assoc.c:251 
>> >       > #3  0x00a61a49 in start_thread () from /lib/libpthread.so.0 
>> >       > #4  0x005c4aee in clone () from /lib/libc.so.6 
>> >       > (gdb) t 5 
>> >       > [Switching to thread 5 (Thread 0xf7771b70 (LWP 24962))]#0 
>>  0x007ad430 in __kernel_vsyscall () 
>> >       > (gdb) bt 
>> >       > #0  0x007ad430 in __kernel_vsyscall () 
>> >       > #1  0x00a68998 in sendmsg () from /lib/libpthread.so.0 
>> >       > #2  0x080509dd in transmit (fd=431, which=2, arg=0xfef8ce48) at 
>> memcached.c:4044 
>> >       > #3  drive_machine (fd=431, which=2, arg=0xfef8ce48) at 
>> memcached.c:4370 
>> >       > #4  event_handler (fd=431, which=2, arg=0xfef8ce48) at 
>> memcached.c:4441 
>> >       > #5  0x0073d9e4 in event_process_active (base=0x9310658, 
>> flags=0) at event.c:395 
>> >       > #6  event_base_loop (base=0x9310658, flags=0) at event.c:547 
>> >       > #7  0x08059fee in worker_libevent (arg=0x930c698) at 
>> thread.c:471 
>> >       > #8  0x00a61a49 in start_thread () from /lib/libpthread.so.0 
>> >       > #9  0x005c4aee in clone () from /lib/libc.so.6 
>> >       > (gdb)  
>> >       > 
>> >       > strace info, there is the only event named maxconnsevent on 
>> epoll? 
>> >       > epoll_wait(4, {}, 32, 10)               = 0 
>> >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 10084037}) = 0 
>> >       > epoll_wait(4, {}, 32, 10)               = 0 
>> >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 20246365}) = 0 
>> >       > epoll_wait(4, {}, 32, 10)               = 0 
>> >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 30382098}) = 0 
>> >       > epoll_wait(4, {}, 32, 10)               = 0 
>> >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 40509766}) = 0 
>> >       > epoll_wait(4, {}, 32, 10)               = 0 
>> >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 50657403}) = 0 
>> >       > epoll_wait(4, {}, 32, 10)               = 0 
>> >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 60823841}) = 0 
>> >       > epoll_wait(4, {}, 32, 10)               = 0 
>> >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 71013006}) = 0 
>> >       > epoll_wait(4, {}, 32, 10)               = 0 
>> >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 81234264}) = 0 
>> >       > epoll_wait(4, {}, 32, 10)               = 0 
>> >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 91407508}) = 0 
>> >       > epoll_wait(4, {}, 32, 10)               = 0 
>> >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 101581187}) = 0 
>> >       > epoll_wait(4, {}, 32, 10)               = 0 
>> >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 111752457}) = 0 
>> >       > epoll_wait(4, {}, 32, 10)               = 0 
>> >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 121919049}) = 0 
>> >       > epoll_wait(4, {}, 32, 10)               = 0 
>> >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 132057597}) = 0 
>> >       > 
>> >       > 
>> >       > 
>> >       > 在 2014年10月29日星期三UTC+8下午2时47分23秒，Samdy Sun写道： 
>> >       >       Hello,  I got a "memcached-1.4.20 stuck" problem when 
>> EMFILE happen. 
>> >       >   Here are my memcached's cmdline "memcached -s 
>> /xxx/mc_usock.11201 -c 1024 -m 4000 -f 1.05 -o slab_automove -o 
>> slab_reassign  -t 1 
>> >       -p 
>> >       > 11201". 
>> >       >   
>> >       >   cat /proc/version  
>> >       >   Linux version 2.6.32-358.el6.x86_64 (
>> [email protected]) (gcc version 4.4.7 20120313 
>> (Red Hat 4.4.7-3) (GCC) 
>> >       ) #1 
>> >       > SMP Tue Jan 29 11:47:41 EST 2013 
>> >       > 
>> >       >   memcached-1.4.20 stuck and don't work any more when it runs 
>> for a period of time. 
>> >       > 
>> >       >   Here are some information for gdb:  (gdb) p stats 
>> >       >   $2 = {mutex = {__data = {__lock = 0, __count = 0, __owner = 
>> 0, __kind = 0, __nusers = 0, {__spins = 0,  
>> >       >         __list = {__next = 0x0}}}, __size = '\000' <repeats 23 
>> times>, __align = 0}, curr_items = 149156,  
>> >       >   total_items = 9876811, curr_bytes = 3712501870, curr_conns = 
>> 5, total_conns = 39738, rejected_conns = 0,  
>> >       >   malloc_fails = 0, reserved_fds = 5, conn_structs = 1012, 
>> get_cmds = 0, set_cmds = 0, touch_cmds = 0,  
>> >       >   get_hits = 0, get_misses = 0, touch_hits = 0, touch_misses = 
>> 0, evictions = 0, reclaimed = 0,  
>> >       >   started = 0, accepting_conns = false, listen_disabled_num = 
>> 1, hash_power_level = 17,  
>> >       >   hash_bytes = 524288, hash_is_expanding = false, 
>> expired_unfetched = 0, evicted_unfetched = 0,  
>> >       >   slab_reassign_running = false, slabs_moved = 20, 
>> lru_crawler_running = false,  
>> >       >   disable_write_by_exptime = 0, disable_write_by_length = 0, 
>> disable_write_by_access = 0,  
>> >       >   evicted_write_reply_timeout_times = 0} 
>> >       > 
>> >       >   (gdb) p allow_new_conns 
>> >       >   $4 = false 
>> >       > 
>> >       >   And I found that "allow_new_conns" just set to false when 
>> "accept" failed and errno is "EMFILE".  
>> >       >   Here are the codes:   
>> >       > static void drive_machine(conn *c) { 
>> >       >                  …… 
>> >       >                  } else if (errno == EMFILE) { 
>> >       >                    if (settings.verbose > 0) 
>> >       >                          fprintf(stderr, "Too many open 
>> connections\n"); 
>> >       >                    accept_new_conns(false); 
>> >       >                    stop = true; 
>> >       >                  } else { 
>> >       >                  …… 
>> >       > } 
>> >       >    
>> >       >   If I change the flag "allow_new_conns", it can work again. As 
>> below: 
>> >       >   (gdb) set allow_new_conns=1 
>> >       >   (gdb) p allow_new_conns 
>> >       >   $5 = true 
>> >       >   (gdb) c 
>> >       >   Continuing. 
>> >       > 
>> >       >   I know that "allow_new_conns" will be set to "true" when 
>> "conn_close" called. But how could it happen for the case that when 
>> >       "accept" 
>> >       > failed , and errno is "EMFILE", and this connection is the only 
>> one for accepting. Notes that curr_conns = 5. 
>> >       >   Not run out of fd: 
>> >       >   ls /proc/1748(memcached_pid)/fd | wc -l 
>> >       >   17 
>> >       >    
>> >       > 
>> >       > -- 
>> >       > 
>> >       > --- 
>> >       > You received this message because you are subscribed to the 
>> Google Groups "memcached" group. 
>> >       > To unsubscribe from this group and stop receiving emails from 
>> it, send an email to [email protected]. 
>> >       > For more options, visit https://groups.google.com/d/optout. 
>> >       > 
>> >       > 
>> > 
>> > -- 
>> > 
>> > --- 
>> > You received this message because you are subscribed to the Google 
>> Groups "memcached" group. 
>> > To unsubscribe from this group and stop receiving emails from it, send 
>> an email to [email protected]. 
>> > For more options, visit https://groups.google.com/d/optout. 
>> > 
>> >
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: memcached-1.4.20 stuck when "Too many open connections"

Reply via email to