In data venerdì 7 settembre 2018 19:11:24 CEST, Timo Sirainen ha scritto: > On 7 Sep 2018, at 19.43, Timo Sirainen <t...@iki.fi> wrote: > > On 7 Sep 2018, at 16.50, Simone Lazzaris <s.lazza...@interactive.eu <mailto:s.lazza...@interactive.eu>> wrote: > >> Some more information: the issue has just occurred, again on an instance > >> without the "service_count = 0" configuration directive on pop3-login. > >> > >> I've observed that while the issue is occurring, the director process > >> goes 100% CPU. I've straced the process. It is seemingly looping: > >> > >> ... > >> ... > >> epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, > >> {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, > >> {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, > >> 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, > >> u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, > >> {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, > >> {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = > >> 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) > >> = 0> > > Nothing else but these epoll_ctl() calls? So it's gone to some loop where > > it keeps calling io_add() and io_remove(). > I'm guessing it's because of doveadm command handling issues, since there's > some weirdness in the code. Although I couldn't figure out exactly why it > would go to infinite loop there. But attached a patch that may fix it, if > you're able to test. We haven't noticed such infinite looping in other > installations or automated director stresstests though.. > >> FD 13 is "anon_inode:[eventpoll]" > > > > What about fd 78? I guess some socket. > > > > Could you also try two more things when it happens again: > > > > ltrace -tt -e '*' -o ltrace.log -p <pid> > > (My guess this isn't going to be very useful, but just in case it might > > be..) > > > > gdb -p <pid> > > bt full > > quit > > > > Preferably install dovecot-dbg package also so the gdb backtrace output > > will be better. > These would still be useful to verify whether I'm even on the right track.
Timo, unfortunately the patch doesn't compile; I've moved the declaration of "conn" one line up to make it work. In other words, the function becomes like this: void doveadm_connections_ring_synced(void) { struct doveadm_connection *conn = doveadm_ring_sync_pending_connections; while (doveadm_ring_sync_pending_connections != NULL && conn->dir->ring_synced) { doveadm_connection_ring_sync_callback_t *callback = conn->ring_sync_callback; conn->ring_sync_callback = NULL; doveadm_connection_ring_sync_list_move(conn); doveadm_connection_set_io(conn); io_set_pending(conn->io); callback(conn); } if (conn->dir->ring_synced) doveadm_connections_continue_reset_cmds(); } I've tried the patched version but without luck: the director keeps crashing: Sep 8 14:10:56 imapfront2 dovecot: director: Connecting to 212.183.164.163:9090 (as 212.183.164.162): Initial connection Sep 8 14:10:56 imapfront2 dovecot: director: Incoming connection from director 212.183.164.161/in Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.163:9090/ out): Handshake finished in 0.002 secs (bytes in=61, bytes out=473, 0+0 USERs received, last input 0.000 s ago, last output 0.000 s ago, connected 0.002 s ago, 207 peak output buffer size, 0.000 CPU secs since connected) Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.142 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.143 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.144 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.145 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.216 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.217 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.218 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.219 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Handshake finished in 0.005 secs (bytes in=67951, bytes out=56, 1788+0 USERs received, last input 0.000 s ago, last output 0.005 s ago, connected 0.005 s ago, 0 peak output buffer size, 0.004 CPU secs since connected) Sep 8 14:10:56 imapfront2 dovecot: pop3-login: Warning: Auth connection closed with 1 pending requests (max 0 secs, pid=22703, EOF) Sep 8 14:10:56 imapfront2 dovecot: auth: Warning: auth client 22703 disconnected with 1 pending requests: EOF Sep 8 14:10:56 imapfront2 dovecot: director: Fatal: master: service(director): child 23037 killed with signal 11 (core dumps disabled) Sep 8 14:10:56 imapfront2 dovecot: pop3-login: Disconnected (auth process communication failure): user=<>, method=DIGEST-MD5, rip=95.253.200.113, lip=80.76.81.122, session=<CLkuBFt1icNf/chx> Sep 8 14:10:56 imapfront2 dovecot: pop3-login: proxy(pa...@cosmocarta.it): started proxying to 192.168.1.142:110: user=<pa...@cosmocarta.it>, method=PLAIN, rip=85.32.56.86, lip=80.76.81.122, session=<KbkuBFt1NXVVIDhW> Sep 8 14:10:56 imapfront2 dovecot: director: Connecting to 212.183.164.163:9090 (as 212.183.164.162): Initial connection Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.163:9090/ out): Handshake finished in 0.000 secs (bytes in=61, bytes out=383, 0+0 USERs received, last input 0.000 s ago, last output 0.000 s ago, connected 0.000 s ago, 207 peak output buffer size, 0.000 CPU secs since connected) Sep 8 14:10:56 imapfront2 dovecot: director: Incoming connection from director 212.183.164.161/in Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.142 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.143 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.144 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.145 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.216 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.217 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.218 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.219 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Handshake finished in 0.005 secs (bytes in=67921, bytes out=56, 1788+0 USERs received, last input 0.000 s ago, last output 0.005 s ago, connected 0.005 s ago, 0 peak output buffer size, 0.004 CPU secs since connected) I'm thinking to downgrade to version 2.2.30.2, which I use on another cluster without issues. -- Simone Lazzaris Qcom SpA