@Dormando,
I think it's not because of malloc fails.
I use "-m 200" and there is 48G ram of my server.
After close(c->sfd), I printf a message as "fprintf(stderr,
"INFO`fd=%d`reason=%s`allow_new_conns=true\n", c->sfd, "conn_close");"
Here are the codes,
* static void conn_close(conn *c) {*
* ……*
* close(c->sfd);*
* pthread_mutex_lock(&conn_lock);*
* allow_new_conns = true;*
* pthread_mutex_unlock(&conn_lock);*
* // XXX: tracing fd close*
* fprintf(stderr, "INFO`fd=%d`reason=%s`allow_new_conns=true\n",
c->sfd, "conn_close");*
* ……*
* }*
Here are the logs,
* ……*
*INFO`fd=816`reason=conn_close`allow_new_conns=true*
*INFO`fd=817`reason=conn_close`allow_new_conns=true*
*Failed to write, and not due to blocking: Broken pipe*
*INFO`fd=818`reason=conn_close`allow_new_conns=true*
*INFO`fd=819`reason=conn_close`allow_new_conns=true*
*Failed to write, and not due to blocking: Broken pipe*
*INFO`fd=820`reason=conn_close`allow_new_conns=true*
*Failed to write, and not due to blocking: Broken pipe*
*INFO`fd=821`reason=conn_close`allow_new_conns=true*
*Failed to write, and not due to blocking: Broken pipe*
*INFO`fd=822`reason=conn_close`allow_new_conns=true*
*Failed to write, and not due to blocking: Broken pipe*
* ……*
*INFO`fd=1011`reason=conn_close`allow_new_conns=true*
*INFO`fd=1012`reason=conn_close`allow_new_conns=true*
*INFO`fd=1013`reason=conn_close`allow_new_conns=true*
*INFO`fd=1014`reason=conn_close`allow_new_conns=true*
*INFO`fd=1015`reason=conn_close`allow_new_conns=true*
*INFO`fd=1016`reason=conn_close`allow_new_conns=true*
*INFO`fd=1017`reason=conn_close`allow_new_conns=true*
*INFO`fd=1018`reason=conn_close`allow_new_conns=true*
*INFO`fd=1019`reason=conn_close`allow_new_conns=true*
*INFO`fd=1020`reason=conn_close`allow_new_conns=true*
*INFO`fd=1021`reason=conn_close`allow_new_conns=true*
*INFO`fd=1022`reason=conn_close`allow_new_conns=true*
*INFO`fd=1023`reason=conn_close`allow_new_conns=true*
*accept(): Too many open files*
*Too many open connections*
*INFO`fd=-1`errno=24(Too many open files)`reason=Too many open connections*
*allow_new_conns=false*
It's very strange that "accept()" fail after close(fd) has called. The flag
called "*allow_new_conns*" is finally set to false, so memcached is stuck.
But when I found that memcached was stuck, the used fd count are small.
ls /proc/6873/fd|wc -l
17
在 2014年11月6日星期四UTC+8上午8时08分50秒,Dormando写道:
>
> There're too many things that will go wrong if malloc fails...
>
> There's a stats counter covering some of them. Is that going up for you?
>
> Have you disabled overcommit memory? Have you observed the process size
> when it hangs? malloc should almost never actually fail under normal
> conditions...
>
> On Wed, 5 Nov 2014, Samdy Sun wrote:
>
> > Hey,
> > I also got a stuck when specifing "-m 200".
> > As mentioned previously, that case could happend as below?
> > 1. malloc fails when "conn_new()" 2. event_add fails when
> "conn_new()"
> > 3. other case?
> >
> > And I find another case after code reviewing. Here is, memcached stuck
> for a while, for which our client close the connection because
> > 200ms-timeout. So, if the previous 1023 connections get timeout and
> memcached calls "transmit" to write, "Broken pipe" error will happend. And
> > then, memcached get "TRANSMIT_HARD_ERROR" error and calls
> "conn_close" immediately.
> > So, it will happend as below?
> > accept(), errno == EMFILE
> > fd1 close,
> > fd2 close,
> > fd3 close,
> > ……
> > fd1023 close,
> > accept_new_conns(false) for EMFILE
> >
> > That just is a supposition, but I will try to log some infomation
> to prove it.
> >
> > Any way, is it better to call "conn_close" after for a while, such as
> waiting for next event when getting TRANSMIT_HARD_ERROR error then to
> > conn_close immediately?
> >
> >
> > 在 2014年10月31日星期五UTC+8下午3时01分06秒,Dormando写道:
> > Hey,
> >
> > How are you reproducing this? How many connections do you
> typically have
> > open?
> >
> > It's really bizarre that your curr_conns is "5", but your
> connections are
> > disabled? Even if there's still a race, as more connections close
> they
> > each have an opportunity to flip the acceptor back on.
> >
> > Can you print what "stats settings" shows? If it's adjusting your
> actual
> > maxconns downward it should show there...
> >
> > On Wed, 29 Oct 2014, Samdy Sun wrote:
> >
> > > There are no deadlocks, (gdb) info thread
> > > * 5 Thread 0xf7771b70 (LWP 24962) 0x080509dd in transmit
> (fd=431, which=2, arg=0xfef8ce48)
> > > at memcached.c:4044
> > > 4 Thread 0xf6d70b70 (LWP 24963) 0x007ad430 in
> __kernel_vsyscall ()
> > > 3 Thread 0xf636fb70 (LWP 24964) 0x007ad430 in
> __kernel_vsyscall ()
> > > 2 Thread 0xf596eb70 (LWP 24965) 0x007ad430 in
> __kernel_vsyscall ()
> > > 1 Thread 0xf77b38d0 (LWP 24961) 0x007ad430 in
> __kernel_vsyscall ()
> > > (gdb) t 1
> > > [Switching to thread 1 (Thread 0xf77b38d0 (LWP 24961))]#0
> 0x007ad430 in __kernel_vsyscall ()
> > > (gdb) bt
> > > #0 0x007ad430 in __kernel_vsyscall ()
> > > #1 0x005c5366 in epoll_wait () from /lib/libc.so.6
> > > #2 0x0074a750 in epoll_dispatch (base=0x9305008, arg=0x93053c0,
> tv=0xff8e0cdc) at epoll.c:198
> > > #3 0x0073d714 in event_base_loop (base=0x9305008, flags=0) at
> event.c:538
> > > #4 0x08054467 in main (argc=19, argv=0xff8e2274) at
> memcached.c:5795
> > > (gdb)
> > >
> > > (gdb) t 2
> > > [Switching to thread 2 (Thread 0xf596eb70 (LWP 24965))]#0
> 0x007ad430 in __kernel_vsyscall ()
> > > (gdb) bt
> > > #0 0x007ad430 in __kernel_vsyscall ()
> > > #1 0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib/libpthread.so.0
> > > #2 0x08055662 in slab_rebalance_thread (arg=0x0) at slabs.c:859
> > > #3 0x00a61a49 in start_thread () from /lib/libpthread.so.0
> > > #4 0x005c4aee in clone () from /lib/libc.so.6
> > > (gdb) t 3
> > > [Switching to thread 3 (Thread 0xf636fb70 (LWP 24964))]#0
> 0x007ad430 in __kernel_vsyscall ()
> > > (gdb) bt
> > > #0 0x007ad430 in __kernel_vsyscall ()
> > > #1 0x005838b6 in nanosleep () from /lib/libc.so.6
> > > #2 0x005836e0 in sleep () from /lib/libc.so.6
> > > #3 0x08056f6e in slab_maintenance_thread (arg=0x0) at
> slabs.c:819
> > > #4 0x00a61a49 in start_thread () from /lib/libpthread.so.0
> > > #5 0x005c4aee in clone () from /lib/libc.so.6
> > > (gdb) t 4
> > > [Switching to thread 4 (Thread 0xf6d70b70 (LWP 24963))]#0
> 0x007ad430 in __kernel_vsyscall ()
> > > (gdb) bt
> > > #0 0x007ad430 in __kernel_vsyscall ()
> > > #1 0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib/libpthread.so.0
> > > #2 0x080599f5 in assoc_maintenance_thread (arg=0x0) at
> assoc.c:251
> > > #3 0x00a61a49 in start_thread () from /lib/libpthread.so.0
> > > #4 0x005c4aee in clone () from /lib/libc.so.6
> > > (gdb) t 5
> > > [Switching to thread 5 (Thread 0xf7771b70 (LWP 24962))]#0
> 0x007ad430 in __kernel_vsyscall ()
> > > (gdb) bt
> > > #0 0x007ad430 in __kernel_vsyscall ()
> > > #1 0x00a68998 in sendmsg () from /lib/libpthread.so.0
> > > #2 0x080509dd in transmit (fd=431, which=2, arg=0xfef8ce48) at
> memcached.c:4044
> > > #3 drive_machine (fd=431, which=2, arg=0xfef8ce48) at
> memcached.c:4370
> > > #4 event_handler (fd=431, which=2, arg=0xfef8ce48) at
> memcached.c:4441
> > > #5 0x0073d9e4 in event_process_active (base=0x9310658, flags=0)
> at event.c:395
> > > #6 event_base_loop (base=0x9310658, flags=0) at event.c:547
> > > #7 0x08059fee in worker_libevent (arg=0x930c698) at
> thread.c:471
> > > #8 0x00a61a49 in start_thread () from /lib/libpthread.so.0
> > > #9 0x005c4aee in clone () from /lib/libc.so.6
> > > (gdb)
> > >
> > > strace info, there is the only event named maxconnsevent on
> epoll?
> > > epoll_wait(4, {}, 32, 10) = 0
> > > clock_gettime(CLOCK_MONOTONIC, {8374269, 10084037}) = 0
> > > epoll_wait(4, {}, 32, 10) = 0
> > > clock_gettime(CLOCK_MONOTONIC, {8374269, 20246365}) = 0
> > > epoll_wait(4, {}, 32, 10) = 0
> > > clock_gettime(CLOCK_MONOTONIC, {8374269, 30382098}) = 0
> > > epoll_wait(4, {}, 32, 10) = 0
> > > clock_gettime(CLOCK_MONOTONIC, {8374269, 40509766}) = 0
> > > epoll_wait(4, {}, 32, 10) = 0
> > > clock_gettime(CLOCK_MONOTONIC, {8374269, 50657403}) = 0
> > > epoll_wait(4, {}, 32, 10) = 0
> > > clock_gettime(CLOCK_MONOTONIC, {8374269, 60823841}) = 0
> > > epoll_wait(4, {}, 32, 10) = 0
> > > clock_gettime(CLOCK_MONOTONIC, {8374269, 71013006}) = 0
> > > epoll_wait(4, {}, 32, 10) = 0
> > > clock_gettime(CLOCK_MONOTONIC, {8374269, 81234264}) = 0
> > > epoll_wait(4, {}, 32, 10) = 0
> > > clock_gettime(CLOCK_MONOTONIC, {8374269, 91407508}) = 0
> > > epoll_wait(4, {}, 32, 10) = 0
> > > clock_gettime(CLOCK_MONOTONIC, {8374269, 101581187}) = 0
> > > epoll_wait(4, {}, 32, 10) = 0
> > > clock_gettime(CLOCK_MONOTONIC, {8374269, 111752457}) = 0
> > > epoll_wait(4, {}, 32, 10) = 0
> > > clock_gettime(CLOCK_MONOTONIC, {8374269, 121919049}) = 0
> > > epoll_wait(4, {}, 32, 10) = 0
> > > clock_gettime(CLOCK_MONOTONIC, {8374269, 132057597}) = 0
> > >
> > >
> > >
> > > 在 2014年10月29日星期三UTC+8下午2时47分23秒,Samdy Sun写道:
> > > Hello, I got a "memcached-1.4.20 stuck" problem when
> EMFILE happen.
> > > Here are my memcached's cmdline "memcached -s
> /xxx/mc_usock.11201 -c 1024 -m 4000 -f 1.05 -o slab_automove -o
> slab_reassign -t 1
> > -p
> > > 11201".
> > >
> > > cat /proc/version
> > > Linux version 2.6.32-358.el6.x86_64 (
> [email protected]) (gcc version 4.4.7 20120313
> (Red Hat 4.4.7-3) (GCC)
> > ) #1
> > > SMP Tue Jan 29 11:47:41 EST 2013
> > >
> > > memcached-1.4.20 stuck and don't work any more when it runs
> for a period of time.
> > >
> > > Here are some information for gdb: (gdb) p stats
> > > $2 = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0,
> __kind = 0, __nusers = 0, {__spins = 0,
> > > __list = {__next = 0x0}}}, __size = '\000' <repeats 23
> times>, __align = 0}, curr_items = 149156,
> > > total_items = 9876811, curr_bytes = 3712501870, curr_conns =
> 5, total_conns = 39738, rejected_conns = 0,
> > > malloc_fails = 0, reserved_fds = 5, conn_structs = 1012,
> get_cmds = 0, set_cmds = 0, touch_cmds = 0,
> > > get_hits = 0, get_misses = 0, touch_hits = 0, touch_misses =
> 0, evictions = 0, reclaimed = 0,
> > > started = 0, accepting_conns = false, listen_disabled_num = 1,
> hash_power_level = 17,
> > > hash_bytes = 524288, hash_is_expanding = false,
> expired_unfetched = 0, evicted_unfetched = 0,
> > > slab_reassign_running = false, slabs_moved = 20,
> lru_crawler_running = false,
> > > disable_write_by_exptime = 0, disable_write_by_length = 0,
> disable_write_by_access = 0,
> > > evicted_write_reply_timeout_times = 0}
> > >
> > > (gdb) p allow_new_conns
> > > $4 = false
> > >
> > > And I found that "allow_new_conns" just set to false when
> "accept" failed and errno is "EMFILE".
> > > Here are the codes:
> > > static void drive_machine(conn *c) {
> > > ……
> > > } else if (errno == EMFILE) {
> > > if (settings.verbose > 0)
> > > fprintf(stderr, "Too many open
> connections\n");
> > > accept_new_conns(false);
> > > stop = true;
> > > } else {
> > > ……
> > > }
> > >
> > > If I change the flag "allow_new_conns", it can work again. As
> below:
> > > (gdb) set allow_new_conns=1
> > > (gdb) p allow_new_conns
> > > $5 = true
> > > (gdb) c
> > > Continuing.
> > >
> > > I know that "allow_new_conns" will be set to "true" when
> "conn_close" called. But how could it happen for the case that when
> > "accept"
> > > failed , and errno is "EMFILE", and this connection is the only
> one for accepting. Notes that curr_conns = 5.
> > > Not run out of fd:
> > > ls /proc/1748(memcached_pid)/fd | wc -l
> > > 17
> > >
> > >
> > > --
> > >
> > > ---
> > > You received this message because you are subscribed to the
> Google Groups "memcached" group.
> > > To unsubscribe from this group and stop receiving emails from
> it, send an email to [email protected].
> > > For more options, visit https://groups.google.com/d/optout.
> > >
> > >
> >
> > --
> >
> > ---
> > You received this message because you are subscribed to the Google
> Groups "memcached" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected] <javascript:>.
> > For more options, visit https://groups.google.com/d/optout.
> >
> >
--
---
You received this message because you are subscribed to the Google Groups
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.