Re: director stuck in inifite loop on 2.2.35

2018-04-26 Thread Aki Tuomi


On 26.04.2018 11:15, Jan-Pieter Cornet wrote:
> Occasionally, all of our directors get stuck in an endless loop.
> Likely triggered by some network event, but I haven't been able to
> isolate exactly what triggers it, and am unable to reproduce at the
> moment. When this happens, it makes the dovecot/director process on
> all directors unresponsive, making login impossible.
>
> Restarting dovecot "fixes" it.
>
> Here's the info collected from the stuck process. I hope that a
> dovecot dev could use this to either fix a bug in the async IO code or
> point to a possible external source that could have triggered this.
> Would it help to compile dovecot with -g and run a version with
> debugging info, to be able to inspect more when the crash happens again?
>
> "strace" showed this endless loop:
>
> 09:27:52.837960 epoll_ctl(14, EPOLL_CTL_DEL, 213, 0x7ffe8e642cdc) = 0
> 09:27:52.837993 epoll_ctl(14, EPOLL_CTL_ADD, 213,
> {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=3317538000,
> u64=94320799358160}}) = 0
> 09:27:52.838035 epoll_ctl(14, EPOLL_CTL_DEL, 213, 0x7ffe8e642cdc) = 0
> 09:27:52.838070 epoll_ctl(14, EPOLL_CTL_ADD, 213,
> {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=3317538000,
> u64=94320799358160}}) = 0
>
> "ltrace" showed the same loop as:
>
> 10:38:01.773867 timeout_add(0x7530, 0x55c8c351df75, 965,
> 0x55c8c3515a00) = 0x55c8c5c658b0
> 10:38:01.774060 timeout_remove(0x55c8c5acf7a8, 0, 0x, 0) = 0
> 10:38:01.774239 io_add(213, 1, 0x55c8c351df75, 1070) =
> 0x55c8c5c658b0
> 10:38:01.774429 io_set_pending(0x55c8c5c658b0, 1, 0x55c8c54b8aa0,
> 0x7fb4ab11b06a) = 0x55c8c54b8aa0
> 10:38:01.774598 p_strarray_dup(0x7fb4ab703660, 0x55c8c5786e00, 5,
> 0x55c8c3723740) = 0x55c8c56d7530
> 10:38:01.774771 io_remove(0x55c8c5acf788, 0x55c8c3516d60, 5, 48) = 0
> 10:38:01.774957 timeout_add(0x7530, 0x55c8c351df75, 965,
> 0x55c8c3515a00) = 0x55c8c5c658b0
> 10:38:01.775129 timeout_remove(0x55c8c5acf7a8, 0, 0x, 0) = 0
> 10:38:01.775301 io_add(213, 1, 0x55c8c351df75, 1070) =
> 0x55c8c5c658b0
> 10:38:01.775492 io_set_pending(0x55c8c5c658b0, 1, 0x55c8c54b8aa0,
> 0x7fb4ab11b06a) = 0x55c8c54b8aa0
> 10:38:01.775662 p_strarray_dup(0x7fb4ab703660, 0x55c8c56d7530, 5,
> 0x55c8c3723740) = 0x55c8c5786e00
> 10:38:01.775833 io_remove(0x55c8c5acf788, 0x55c8c3516d60, 5, 48) = 0
>
> That fd 213 is the next-highest FD opened, and is bound to
> /var/run/dovecot/director-admin. In fact the last few FDs are all
> connections to director-admin. Last few lines of "lsof" say:
>
> director 16109 dovecot  207u unix 0x940d13c93800  0t0
> 1106341075 type=STREAM
> director 16109 dovecot  208u unix 0x9409f528  0t0
> 1106337217 /var/run/dovecot/director-userdb type=STREAM
> director 16109 dovecot  209u unix 0x940d580fa000  0t0
> 1106337218 type=STREAM
> director 16109 dovecot  210u unix 0x940ac0f61000  0t0
> 1106343439 /var/run/dovecot/director-userdb type=STREAM
> director 16109 dovecot  211u unix 0x940bed105400  0t0
> 1106343440 type=STREAM
> director 16109 dovecot  212u unix 0x940b76d31c00  0t0
> 1106342408 /var/run/dovecot/director-admin type=STREAM
> director 16109 dovecot  213u unix 0x940ab241d000  0t0
> 1106340510 /var/run/dovecot/director-admin type=STREAM
> director 16109 dovecot  214u unix 0x9409c317ec00  0t0
> 1106340511 /var/run/dovecot/director-admin type=STREAM
>
> I attached with "gdb" and did "bt" at two different moments, resulting
> in:
>
> (gdb) bt
> #0  0x7fb4ab11b06a in epoll_ctl () from
> target:/lib/x86_64-linux-gnu/libc.so.6
> #1  0x7fb4ab48964d in io_loop_handle_add () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #2  0x7fb4ab4876b9 in ?? () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #3  0x7fb4ab48776d in io_add () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #4  0x55c8c35179d3 in doveadm_connections_ring_synced ()
> #5  0x7fb4ab488166 in io_loop_handle_timeouts () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #6  0x7fb4ab489963 in io_loop_handler_run_internal ()
>    from target:/usr/lib/dovecot/libdovecot.so.0
> #7  0x7fb4ab48839c in io_loop_handler_run () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #8  0x7fb4ab488548 in io_loop_run () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #9  0x7fb4ab40c0a3 in master_service_run () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #10 0x55c8c350b46d in main ()
> (gdb) c
> Continuing.
> ^C
> Program received signal SIGINT, Interrupt.
> 0x7fb4ab11b06a in epoll_ctl () from
> target:/lib/x86_64-linux-gnu/libc.so.6
> (gdb) bt
> #0  0x7fb4ab11b06a in epoll_ctl () from
> target:/lib/x86_64-linux-gnu/libc.so.6
> #1  0x7fb4ab489873 in io_loop_handle_remove () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #2  0x7fb4ab487adf in ?? () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #3  0x55c8c35158b7 in ?? ()
> #4  0x55c8c3516b4a in ?? ()
> #5  0x55c8c3516d78 in ?? ()
> #6  

director stuck in inifite loop on 2.2.35

2018-04-26 Thread Jan-Pieter Cornet

Occasionally, all of our directors get stuck in an endless loop. Likely 
triggered by some network event, but I haven't been able to isolate exactly 
what triggers it, and am unable to reproduce at the moment. When this happens, 
it makes the dovecot/director process on all directors unresponsive, making 
login impossible.

Restarting dovecot "fixes" it.

Here's the info collected from the stuck process. I hope that a dovecot dev 
could use this to either fix a bug in the async IO code or point to a possible 
external source that could have triggered this. Would it help to compile 
dovecot with -g and run a version with debugging info, to be able to inspect 
more when the crash happens again?

"strace" showed this endless loop:

09:27:52.837960 epoll_ctl(14, EPOLL_CTL_DEL, 213, 0x7ffe8e642cdc) = 0
09:27:52.837993 epoll_ctl(14, EPOLL_CTL_ADD, 213, 
{EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=3317538000, u64=94320799358160}}) = 0
09:27:52.838035 epoll_ctl(14, EPOLL_CTL_DEL, 213, 0x7ffe8e642cdc) = 0
09:27:52.838070 epoll_ctl(14, EPOLL_CTL_ADD, 213, 
{EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=3317538000, u64=94320799358160}}) = 0

"ltrace" showed the same loop as:

10:38:01.773867 timeout_add(0x7530, 0x55c8c351df75, 965, 0x55c8c3515a00) = 
0x55c8c5c658b0
10:38:01.774060 timeout_remove(0x55c8c5acf7a8, 0, 0x, 0) = 0
10:38:01.774239 io_add(213, 1, 0x55c8c351df75, 1070) = 0x55c8c5c658b0
10:38:01.774429 io_set_pending(0x55c8c5c658b0, 1, 0x55c8c54b8aa0, 
0x7fb4ab11b06a) = 0x55c8c54b8aa0
10:38:01.774598 p_strarray_dup(0x7fb4ab703660, 0x55c8c5786e00, 5, 
0x55c8c3723740) = 0x55c8c56d7530
10:38:01.774771 io_remove(0x55c8c5acf788, 0x55c8c3516d60, 5, 48) = 0
10:38:01.774957 timeout_add(0x7530, 0x55c8c351df75, 965, 0x55c8c3515a00) = 
0x55c8c5c658b0
10:38:01.775129 timeout_remove(0x55c8c5acf7a8, 0, 0x, 0) = 0
10:38:01.775301 io_add(213, 1, 0x55c8c351df75, 1070) = 0x55c8c5c658b0
10:38:01.775492 io_set_pending(0x55c8c5c658b0, 1, 0x55c8c54b8aa0, 
0x7fb4ab11b06a) = 0x55c8c54b8aa0
10:38:01.775662 p_strarray_dup(0x7fb4ab703660, 0x55c8c56d7530, 5, 
0x55c8c3723740) = 0x55c8c5786e00
10:38:01.775833 io_remove(0x55c8c5acf788, 0x55c8c3516d60, 5, 48) = 0

That fd 213 is the next-highest FD opened, and is bound to 
/var/run/dovecot/director-admin. In fact the last few FDs are all connections to 
director-admin. Last few lines of "lsof" say:

director 16109 dovecot  207u unix 0x940d13c93800  0t0 1106341075 
type=STREAM
director 16109 dovecot  208u unix 0x9409f528  0t0 1106337217 
/var/run/dovecot/director-userdb type=STREAM
director 16109 dovecot  209u unix 0x940d580fa000  0t0 1106337218 
type=STREAM
director 16109 dovecot  210u unix 0x940ac0f61000  0t0 1106343439 
/var/run/dovecot/director-userdb type=STREAM
director 16109 dovecot  211u unix 0x940bed105400  0t0 1106343440 
type=STREAM
director 16109 dovecot  212u unix 0x940b76d31c00  0t0 1106342408 
/var/run/dovecot/director-admin type=STREAM
director 16109 dovecot  213u unix 0x940ab241d000  0t0 1106340510 
/var/run/dovecot/director-admin type=STREAM
director 16109 dovecot  214u unix 0x9409c317ec00  0t0 1106340511 
/var/run/dovecot/director-admin type=STREAM

I attached with "gdb" and did "bt" at two different moments, resulting in:

(gdb) bt
#0  0x7fb4ab11b06a in epoll_ctl () from 
target:/lib/x86_64-linux-gnu/libc.so.6
#1  0x7fb4ab48964d in io_loop_handle_add () from 
target:/usr/lib/dovecot/libdovecot.so.0
#2  0x7fb4ab4876b9 in ?? () from target:/usr/lib/dovecot/libdovecot.so.0
#3  0x7fb4ab48776d in io_add () from target:/usr/lib/dovecot/libdovecot.so.0
#4  0x55c8c35179d3 in doveadm_connections_ring_synced ()
#5  0x7fb4ab488166 in io_loop_handle_timeouts () from 
target:/usr/lib/dovecot/libdovecot.so.0
#6  0x7fb4ab489963 in io_loop_handler_run_internal ()
   from target:/usr/lib/dovecot/libdovecot.so.0
#7  0x7fb4ab48839c in io_loop_handler_run () from 
target:/usr/lib/dovecot/libdovecot.so.0
#8  0x7fb4ab488548 in io_loop_run () from 
target:/usr/lib/dovecot/libdovecot.so.0
#9  0x7fb4ab40c0a3 in master_service_run () from 
target:/usr/lib/dovecot/libdovecot.so.0
#10 0x55c8c350b46d in main ()
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x7fb4ab11b06a in epoll_ctl () from target:/lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x7fb4ab11b06a in epoll_ctl () from 
target:/lib/x86_64-linux-gnu/libc.so.6
#1  0x7fb4ab489873 in io_loop_handle_remove () from 
target:/usr/lib/dovecot/libdovecot.so.0
#2  0x7fb4ab487adf in ?? () from target:/usr/lib/dovecot/libdovecot.so.0
#3  0x55c8c35158b7 in ?? ()
#4  0x55c8c3516b4a in ?? ()
#5  0x55c8c3516d78 in ?? ()
#6  0x55c8c35179e4 in doveadm_connections_ring_synced ()
#7  0x7fb4ab488166 in io_loop_handle_timeouts () from 
target:/usr/lib/dovecot/libdovecot.so.0
#8  0x7fb4ab489963 in io_loop_handler_run_internal ()
   from