Re: memcached-1.4.20 stuck when "Too many open connections"

Samdy Sun Mon, 03 Nov 2014 05:04:01 -0800

……
Because we also have 32-bit systems and 64-bit applications don't work on 
them. So we build memcached on 32-bit system conformably.


在 2014年11月3日星期一UTC+8下午2时56分01秒，Dormando写道：
>
> I don't really know offhand. why are you running 32bit memcached at all? 
> Just run it in 64bit mode? 
>
> On Sun, 2 Nov 2014, Samdy Sun wrote: 
>
> >   Thanks, Dormando. I will try with smaller ram memory. 
> >    
> >   32bit applications can malloc about 3.8G of ram on my 64bit system, so 
> "-m 3200" may be ok? 
> >    
> > 
> > 在 2014年11月1日星期六UTC+8上午12时30分29秒，Dormando写道： 
> >       Hey, 
> > 
> >       32-bit memcached with -m 4000 will never work. the best you can do 
> is 
> >       probably -m 1600. 32bit applications typically can only allocate 
> up to 2G 
> >       of ram. 
> > 
> >       memcached isn't protected from a lot of malloc failure scenarios, 
> so what 
> >       you're doing will never work. 
> > 
> >       -m 4000 only limits the slab memory usage. there're a lot of 
> buffers/etc 
> >       outside of that. Also the hash table, which is measured 
> separately. 
> > 
> >       On Fri, 31 Oct 2014, Samdy Sun wrote: 
> > 
> >       > @Dormando,   
> >       >   I try my best to reproduce this in my environment, but failed. 
> This just happened on my servers.  
> >       > 
> >       >   I use "stats" command to check the memcached if it 
> is available or not. If the memcached is unavailable, we will not send 
> request 
> >       to it.  
> >       > 
> >       >   This is what I feel strange when my curr_conns is "5" and 
> memcached can't recover itself. I think "conn_new" call maybe fail, and 
> >       it call 
> >       > "close(fd)" directly, not "conn_close()"? Such as below? 
> >       > 
> >       >   1. malloc fails when "conn_new()" 
> >       >   2. event_add fails when "conn_new()" 
> >       >   3. other case? 
> >       > 
> >       >   Take notice of that I build "memcached" on 32-bit system and 
> it runs on 64-bit system. Additionally, I use "-m 4000" for 
> >       memcached's start. 
> >       > 
> >       >   Thanks, 
> >       >   Samdy Sun 
> >       > 
> >       > 在 2014年10月31日星期五UTC+8下午3时01分06秒，Dormando写道： 
> >       >       Hey, 
> >       > 
> >       >       How are you reproducing this? How many connections do you 
> typically have 
> >       >       open? 
> >       > 
> >       >       It's really bizarre that your curr_conns is "5", but your 
> connections are 
> >       >       disabled? Even if there's still a race, as more 
> connections close they 
> >       >       each have an opportunity to flip the acceptor back on. 
> >       > 
> >       >       Can you print what "stats settings" shows? If it's 
> adjusting your actual 
> >       >       maxconns downward it should show there... 
> >       > 
> >       >       On Wed, 29 Oct 2014, Samdy Sun wrote: 
> >       > 
> >       >       > There are no deadlocks, (gdb) info thread 
> >       >       > * 5 Thread 0xf7771b70 (LWP 24962)  0x080509dd in 
> transmit (fd=431, which=2, arg=0xfef8ce48) 
> >       >       >     at memcached.c:4044 
> >       >       >   4 Thread 0xf6d70b70 (LWP 24963)  0x007ad430 in 
> __kernel_vsyscall () 
> >       >       >   3 Thread 0xf636fb70 (LWP 24964)  0x007ad430 in 
> __kernel_vsyscall () 
> >       >       >   2 Thread 0xf596eb70 (LWP 24965)  0x007ad430 in 
> __kernel_vsyscall () 
> >       >       >   1 Thread 0xf77b38d0 (LWP 24961)  0x007ad430 in 
> __kernel_vsyscall () 
> >       >       > (gdb) t 1 
> >       >       > [Switching to thread 1 (Thread 0xf77b38d0 (LWP 
> 24961))]#0  0x007ad430 in __kernel_vsyscall () 
> >       >       > (gdb) bt 
> >       >       > #0  0x007ad430 in __kernel_vsyscall () 
> >       >       > #1  0x005c5366 in epoll_wait () from /lib/libc.so.6 
> >       >       > #2  0x0074a750 in epoll_dispatch (base=0x9305008, 
> arg=0x93053c0, tv=0xff8e0cdc) at epoll.c:198 
> >       >       > #3  0x0073d714 in event_base_loop (base=0x9305008, 
> flags=0) at event.c:538 
> >       >       > #4  0x08054467 in main (argc=19, argv=0xff8e2274) at 
> memcached.c:5795 
> >       >       > (gdb)  
> >       >       > 
> >       >       > (gdb) t 2 
> >       >       > [Switching to thread 2 (Thread 0xf596eb70 (LWP 
> 24965))]#0  0x007ad430 in __kernel_vsyscall () 
> >       >       > (gdb) bt 
> >       >       > #0  0x007ad430 in __kernel_vsyscall () 
> >       >       > #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib/libpthread.so.0 
> >       >       > #2  0x08055662 in slab_rebalance_thread (arg=0x0) at 
> slabs.c:859 
> >       >       > #3  0x00a61a49 in start_thread () from 
> /lib/libpthread.so.0 
> >       >       > #4  0x005c4aee in clone () from /lib/libc.so.6 
> >       >       > (gdb) t 3 
> >       >       > [Switching to thread 3 (Thread 0xf636fb70 (LWP 
> 24964))]#0  0x007ad430 in __kernel_vsyscall () 
> >       >       > (gdb) bt 
> >       >       > #0  0x007ad430 in __kernel_vsyscall () 
> >       >       > #1  0x005838b6 in nanosleep () from /lib/libc.so.6 
> >       >       > #2  0x005836e0 in sleep () from /lib/libc.so.6 
> >       >       > #3  0x08056f6e in slab_maintenance_thread (arg=0x0) at 
> slabs.c:819 
> >       >       > #4  0x00a61a49 in start_thread () from 
> /lib/libpthread.so.0 
> >       >       > #5  0x005c4aee in clone () from /lib/libc.so.6 
> >       >       > (gdb) t 4 
> >       >       > [Switching to thread 4 (Thread 0xf6d70b70 (LWP 
> 24963))]#0  0x007ad430 in __kernel_vsyscall () 
> >       >       > (gdb) bt 
> >       >       > #0  0x007ad430 in __kernel_vsyscall () 
> >       >       > #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib/libpthread.so.0 
> >       >       > #2  0x080599f5 in assoc_maintenance_thread (arg=0x0) at 
> assoc.c:251 
> >       >       > #3  0x00a61a49 in start_thread () from 
> /lib/libpthread.so.0 
> >       >       > #4  0x005c4aee in clone () from /lib/libc.so.6 
> >       >       > (gdb) t 5 
> >       >       > [Switching to thread 5 (Thread 0xf7771b70 (LWP 
> 24962))]#0  0x007ad430 in __kernel_vsyscall () 
> >       >       > (gdb) bt 
> >       >       > #0  0x007ad430 in __kernel_vsyscall () 
> >       >       > #1  0x00a68998 in sendmsg () from /lib/libpthread.so.0 
> >       >       > #2  0x080509dd in transmit (fd=431, which=2, 
> arg=0xfef8ce48) at memcached.c:4044 
> >       >       > #3  drive_machine (fd=431, which=2, arg=0xfef8ce48) at 
> memcached.c:4370 
> >       >       > #4  event_handler (fd=431, which=2, arg=0xfef8ce48) at 
> memcached.c:4441 
> >       >       > #5  0x0073d9e4 in event_process_active (base=0x9310658, 
> flags=0) at event.c:395 
> >       >       > #6  event_base_loop (base=0x9310658, flags=0) at 
> event.c:547 
> >       >       > #7  0x08059fee in worker_libevent (arg=0x930c698) at 
> thread.c:471 
> >       >       > #8  0x00a61a49 in start_thread () from 
> /lib/libpthread.so.0 
> >       >       > #9  0x005c4aee in clone () from /lib/libc.so.6 
> >       >       > (gdb)  
> >       >       > 
> >       >       > strace info, there is the only event named 
> maxconnsevent on epoll? 
> >       >       > epoll_wait(4, {}, 32, 10)               = 0 
> >       >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 10084037}) = 0 
> >       >       > epoll_wait(4, {}, 32, 10)               = 0 
> >       >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 20246365}) = 0 
> >       >       > epoll_wait(4, {}, 32, 10)               = 0 
> >       >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 30382098}) = 0 
> >       >       > epoll_wait(4, {}, 32, 10)               = 0 
> >       >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 40509766}) = 0 
> >       >       > epoll_wait(4, {}, 32, 10)               = 0 
> >       >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 50657403}) = 0 
> >       >       > epoll_wait(4, {}, 32, 10)               = 0 
> >       >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 60823841}) = 0 
> >       >       > epoll_wait(4, {}, 32, 10)               = 0 
> >       >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 71013006}) = 0 
> >       >       > epoll_wait(4, {}, 32, 10)               = 0 
> >       >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 81234264}) = 0 
> >       >       > epoll_wait(4, {}, 32, 10)               = 0 
> >       >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 91407508}) = 0 
> >       >       > epoll_wait(4, {}, 32, 10)               = 0 
> >       >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 101581187}) = 0 
> >       >       > epoll_wait(4, {}, 32, 10)               = 0 
> >       >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 111752457}) = 0 
> >       >       > epoll_wait(4, {}, 32, 10)               = 0 
> >       >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 121919049}) = 0 
> >       >       > epoll_wait(4, {}, 32, 10)               = 0 
> >       >       > clock_gettime(CLOCK_MONOTONIC, {8374269, 132057597}) = 0 
> >       >       > 
> >       >       > 
> >       >       > 
> >       >       > 在 2014年10月29日星期三UTC+8下午2时47分23秒，Samdy Sun写道： 
> >       >       >       Hello,  I got a "memcached-1.4.20 stuck" problem 
> when EMFILE happen. 
> >       >       >   Here are my memcached's cmdline "memcached -s 
> /xxx/mc_usock.11201 -c 1024 -m 4000 -f 1.05 -o slab_automove -o 
> slab_reassign 
> >        -t 1 
> >       >       -p 
> >       >       > 11201". 
> >       >       >   
> >       >       >   cat /proc/version  
> >       >       >   Linux version 2.6.32-358.el6.x86_64 (
> [email protected]) (gcc version 4.4.7 20120313 
> (Red Hat 
> >       4.4.7-3) (GCC) 
> >       >       ) #1 
> >       >       > SMP Tue Jan 29 11:47:41 EST 2013 
> >       >       > 
> >       >       >   memcached-1.4.20 stuck and don't work any more when it 
> runs for a period of time. 
> >       >       > 
> >       >       >   Here are some information for gdb:  (gdb) p stats 
> >       >       >   $2 = {mutex = {__data = {__lock = 0, __count = 0, 
> __owner = 0, __kind = 0, __nusers = 0, {__spins = 0,  
> >       >       >         __list = {__next = 0x0}}}, __size = '\000' 
> <repeats 23 times>, __align = 0}, curr_items = 149156,  
> >       >       >   total_items = 9876811, curr_bytes = 3712501870, 
> curr_conns = 5, total_conns = 39738, rejected_conns = 0,  
> >       >       >   malloc_fails = 0, reserved_fds = 5, conn_structs = 
> 1012, get_cmds = 0, set_cmds = 0, touch_cmds = 0,  
> >       >       >   get_hits = 0, get_misses = 0, touch_hits = 0, 
> touch_misses = 0, evictions = 0, reclaimed = 0,  
> >       >       >   started = 0, accepting_conns = false, 
> listen_disabled_num = 1, hash_power_level = 17,  
> >       >       >   hash_bytes = 524288, hash_is_expanding = false, 
> expired_unfetched = 0, evicted_unfetched = 0,  
> >       >       >   slab_reassign_running = false, slabs_moved = 20, 
> lru_crawler_running = false,  
> >       >       >   disable_write_by_exptime = 0, disable_write_by_length 
> = 0, disable_write_by_access = 0,  
> >       >       >   evicted_write_reply_timeout_times = 0} 
> >       >       > 
> >       >       >   (gdb) p allow_new_conns 
> >       >       >   $4 = false 
> >       >       > 
> >       >       >   And I found that "allow_new_conns" just set to false 
> when "accept" failed and errno is "EMFILE".  
> >       >       >   Here are the codes:   
> >       >       > static void drive_machine(conn *c) { 
> >       >       >                  …… 
> >       >       >                  } else if (errno == EMFILE) { 
> >       >       >                    if (settings.verbose > 0) 
> >       >       >                          fprintf(stderr, "Too many open 
> connections\n"); 
> >       >       >                    accept_new_conns(false); 
> >       >       >                    stop = true; 
> >       >       >                  } else { 
> >       >       >                  …… 
> >       >       > } 
> >       >       >    
> >       >       >   If I change the flag "allow_new_conns", it can work 
> again. As below: 
> >       >       >   (gdb) set allow_new_conns=1 
> >       >       >   (gdb) p allow_new_conns 
> >       >       >   $5 = true 
> >       >       >   (gdb) c 
> >       >       >   Continuing. 
> >       >       > 
> >       >       >   I know that "allow_new_conns" will be set to "true" 
> when "conn_close" called. But how could it happen for the case that 
> >       when 
> >       >       "accept" 
> >       >       > failed , and errno is "EMFILE", and this connection is 
> the only one for accepting. Notes that curr_conns = 5. 
> >       >       >   Not run out of fd: 
> >       >       >   ls /proc/1748(memcached_pid)/fd | wc -l 
> >       >       >   17 
> >       >       >    
> >       >       > 
> >       >       > -- 
> >       >       > 
> >       >       > --- 
> >       >       > You received this message because you are subscribed to 
> the Google Groups "memcached" group. 
> >       >       > To unsubscribe from this group and stop receiving emails 
> from it, send an email to [email protected]. 
> >       >       > For more options, visit 
> https://groups.google.com/d/optout. 
> >       >       > 
> >       >       > 
> >       > 
> >       > -- 
> >       > 
> >       > --- 
> >       > You received this message because you are subscribed to the 
> Google Groups "memcached" group. 
> >       > To unsubscribe from this group and stop receiving emails from 
> it, send an email to [email protected]. 
> >       > For more options, visit https://groups.google.com/d/optout. 
> >       > 
> >       > 
> > 
> > -- 
> > 
> > --- 
> > You received this message because you are subscribed to the Google 
> Groups "memcached" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to [email protected] <javascript:>. 
> > For more options, visit https://groups.google.com/d/optout. 
> > 
> >

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: memcached-1.4.20 stuck when "Too many open connections"

Reply via email to