Re: Auth process sometimes stop responding after upgrade

2018-09-19 Thread Simone Lazzaris
> No, forget about that patch. Looks like I forgot I had already fixed this
> crash, and I guess I was testing with master mainly, which is why I wasn't
> able to reproduce the crash now:
> https://github.com/dovecot/core/commit/c0583917fe760b2d901acf83387cc8edb6f9
> 9550
>  99550>

You nailed it!

I've applied the fix from commit c058 and that solved the issue. I'm no longer 
able to crash 
the director by stopping/starting a backend.



*Simone Lazzaris*
*Qcom S.p.A.*
simone.lazza...@qcom.it[1] | www.qcom.it[2]
* LinkedIn[3]* | *Facebook[4]*
[5] 







[1] mailto:simone.lazza...@qcom.it
[2] https://www.qcom.it
[3] https://www.linkedin.com/company/qcom-spa
[4] http://www.facebook.com/qcomspa
[5] https://www.qcom.it/includes/email-banner.gif


Re: Auth process sometimes stop responding after upgrade

2018-09-19 Thread Timo Sirainen
On 19 Sep 2018, at 11.42, Timo Sirainen  wrote:
> 
> On 19 Sep 2018, at 11.30, Timo Sirainen mailto:t...@iki.fi>> 
> wrote:
>> 
>>> 
>>> On 19 Sep 2018, at 11.11, Simone Lazzaris >> > wrote:
>>> 
>>> In data mercoledì 19 settembre 2018 09:30:47 CEST, Timo Sirainen ha scritto:
>>> > On 18 Sep 2018, at 15.15, Simone Lazzaris >> > > wrote:
>>> > > I've got a core dump, and here is the backtrace. Let me know if you want
>>> > > the core file.
>>> > It would be useful if we're able to use it. Could you use
>>> > https://dovecot.org/tools/core-tar.sh 
>>> > 
>>> > >> > > to send the core and the related
>>> > binaries (e.g. just email to me)? The usage is explained at the beginning
>>> > of the script. At least in theory we could then debug with the core file,
>>> > although I've had some trouble even then.
>>> > 
>>> > But just in case the core doesn't work, could you also do:
>>> > 
>>> > bt full
>>> > fr 8
>>> > p *((struct doveadm_connection *)io->context)
>>> > p *((struct doveadm_connection *)io->context)->input
>>>  
>>> I'm sending you the tarball created with core-tar; and just in case:
>> 
>> Thanks, the core worked fine. Does the attached patch (on top of the 
>> previous one) help?
>> 
>> 
> 
> Or here's a slightly different patch, although it should be basically the 
> same fix. This includes the previous patch as well.
> 
> 

No, forget about that patch. Looks like I forgot I had already fixed this 
crash, and I guess I was testing with master mainly, which is why I wasn't able 
to reproduce the crash now: 
https://github.com/dovecot/core/commit/c0583917fe760b2d901acf83387cc8edb6f99550 




Re: Auth process sometimes stop responding after upgrade

2018-09-19 Thread Timo Sirainen
On 19 Sep 2018, at 11.30, Timo Sirainen  wrote:On 19 Sep 2018, at 11.11, Simone Lazzaris  wrote:In data mercoledì 19 settembre 2018 09:30:47 CEST, Timo Sirainen ha scritto:> On 18 Sep 2018, at 15.15, Simone Lazzaris  wrote:> > I've got a core dump, and here is the backtrace. Let me know if you want> > the core file.> It would be useful if we're able to use it. Could you use> https://dovecot.org/tools/core-tar.sh>  to send the core and the related> binaries (e.g. just email to me)? The usage is explained at the beginning> of the script. At least in theory we could then debug with the core file,> although I've had some trouble even then.> > But just in case the core doesn't work, could you also do:> > bt full> fr 8> p *((struct doveadm_connection *)io->context)> p *((struct doveadm_connection *)io->context)->input I'm sending you the tarball created with core-tar; and just in case:Thanks, the core worked fine. Does the attached patch (on top of the previous one) help?Or here's a slightly different patch, although it should be basically the same fix. This includes the previous patch as well.

director2.diff
Description: Binary data


Re: Auth process sometimes stop responding after upgrade

2018-09-19 Thread Timo Sirainen
On 19 Sep 2018, at 11.11, Simone Lazzaris  wrote:In data mercoledì 19 settembre 2018 09:30:47 CEST, Timo Sirainen ha scritto:> On 18 Sep 2018, at 15.15, Simone Lazzaris  wrote:> > I've got a core dump, and here is the backtrace. Let me know if you want> > the core file.> It would be useful if we're able to use it. Could you use> https://dovecot.org/tools/core-tar.sh>  to send the core and the related> binaries (e.g. just email to me)? The usage is explained at the beginning> of the script. At least in theory we could then debug with the core file,> although I've had some trouble even then.> > But just in case the core doesn't work, could you also do:> > bt full> fr 8> p *((struct doveadm_connection *)io->context)> p *((struct doveadm_connection *)io->context)->input I'm sending you the tarball created with core-tar; and just in case:Thanks, the core worked fine. Does the attached patch (on top of the previous one) help?

director.diff
Description: Binary data


Re: Auth process sometimes stop responding after upgrade

2018-09-19 Thread Simone Lazzaris
In data mercoledì 19 settembre 2018 09:30:47 CEST, Timo Sirainen ha scritto:
> On 18 Sep 2018, at 15.15, Simone Lazzaris  wrote:
> > I've got a core dump, and here is the backtrace. Let me know if you want
> > the core file.
> It would be useful if we're able to use it. Could you use
> https://dovecot.org/tools/core-tar.sh
>  to send the core and the related
> binaries (e.g. just email to me)? The usage is explained at the beginning
> of the script. At least in theory we could then debug with the core file,
> although I've had some trouble even then.
> 
> But just in case the core doesn't work, could you also do:
> 
> bt full
> fr 8
> p *((struct doveadm_connection *)io->context)
> p *((struct doveadm_connection *)io->context)->input

I'm sending you the tarball created with core-tar; and just in case:

root@imap-front4:/usr/local/src/dovecot-2.2.36# gdb 
./src/director/.libs/director /var/tmp/
core.10733 
GNU gdb (GDB) 7.4.1-debian
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i486-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from 
/usr/local/src/dovecot-2.2.36/src/director/.libs/director...done.
[New LWP 10733]

warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library 
"/lib/i386-linux-gnu/i686/cmov/libthread_db.so.1".
Core was generated by `dovecot/director'.
Program terminated with signal 6, Aborted.
#0  0xb76e4428 in __kernel_vsyscall ()
(gdb) bt full
#0  0xb76e4428 in __kernel_vsyscall ()
No symbol table info available.
#1  0xb74636c1 in raise () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
No symbol table info available.
#2  0xb7466af2 in abort () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
No symbol table info available.
#3  0xb76485ae in default_fatal_finish (type=, 
status=status@entry=0) at 
failures.c:201
backtrace = 0x82b5168 "/usr/local/lib/dovecot/libdovecot.so.0(+0xa15be) 
[0xb76485be] -> /usr/local/lib/dovecot/libdovecot.so.0(+0xa1641) [0xb7648641] 
-> /usr/
local/lib/dovecot/libdovecot.so.0(i_fatal+0) [0xb75ce35e] -> dove"...
#4  0xb7648641 in i_internal_fatal_handler (ctx=0xbf839cc0, format=0x805c274 
"file %s: 
line %d (%s): assertion failed: (%s)", args=0xbf839ce4 "4\370\005\bI\004") at 
failures.c:670
status = 0
#5  0xb75ce35e in i_panic (format=format@entry=0x805c274 "file %s: line %d 
(%s): 
assertion failed: (%s)") at failures.c:275
ctx = {type = LOG_TYPE_PANIC, exit_status = 0, timestamp = 0x0, 
timestamp_usecs = 0}
args = 0xbf839ce4 "4\370\005\bI\004"
#6  0x080574f7 in doveadm_connection_deinit (_conn=_conn@entry=0xbf839d60) at 
doveadm-connection.c:1097
conn = 0x82fb580
__FUNCTION__ = "doveadm_connection_deinit"
#7  0x08057f03 in doveadm_connection_input (conn=0x0) at 
doveadm-connection.c:1051
line = 
ret = 
#8  0xb76613db in io_loop_call_io (io=0x82fb780) at ioloop.c:600
ioloop = 0x82bd648
t_id = 2
__FUNCTION__ = "io_loop_call_io"
#9  0xb7662e1e in io_loop_handler_run_internal (ioloop=ioloop@entry=0x82bd648) 
at 
ioloop-epoll.c:223
ctx = 0x82c9a40
events = 0x0
event = 0x82c9a80
list = 0x82e1830
io = 
tv = {tv_sec = 0, tv_usec = 236182}
events_count = 0
msecs = 
ret = 1
i = 
j = 
call = 
__FUNCTION__ = "io_loop_handler_run_internal"
#10 0xb7661496 in io_loop_handler_run (ioloop=ioloop@entry=0x82bd648) at 
ioloop.c:649
No locals.
#11 0xb7661658 in io_loop_run (ioloop=0x82bd648) at ioloop.c:624
__FUNCTION__ = "io_loop_run"
#12 0xb75da45e in master_service_run (service=0x82bd578, 
callback=callback@entry=0x804d360 ) at master-service.c:719
No locals.
#13 0x0804cf5e in main (argc=1, argv=0x82bd300) at main.c:366
set_roots = {0x805f6c0, 0x0}
test_port = 0
error = 
debug = false
c = 
(gdb) fr 8
#8  0xb76613db in io_loop_call_io (io=0x82fb780) at ioloop.c:600
600 io->callback(io->context);
(gdb) p *((struct doveadm_connection *)io->context)
$1 = {prev = 0x0, next = 0x0, fd = 21, io = 0x0, input = 0x82fc478, output = 
0x82fc598, dir = 
0x82c20a8, to_ring_sync_abort = 0x82e8de0, reset_cmd = 0x0, kick_cmd = 0x0, 
  ring_sync_callback = 0x80570d0 , cmd_pending_args 
= 
0x0, cmd_pending_idx = 0, handshaked = 1}
(gdb) p *((struct doveadm_connection *)io->context)->input
$2 = {v_offset = 51, stream_errno = 0, mmaped = 0, blocking = 0, closed = 0, 
readable_fd = 
1, seekable = 0, eof = 1, real_stream = 0x82fc440}
(gdb) quit



Re: Auth process sometimes stop responding after upgrade

2018-09-19 Thread Timo Sirainen
On 18 Sep 2018, at 15.15, Simone Lazzaris  wrote:
> 
> I've got a core dump, and here is the backtrace. Let me know if you want the 
> core file.

It would be useful if we're able to use it. Could you use 
https://dovecot.org/tools/core-tar.sh  
to send the core and the related binaries (e.g. just email to me)? The usage is 
explained at the beginning of the script. At least in theory we could then 
debug with the core file, although I've had some trouble even then.

But just in case the core doesn't work, could you also do:

bt full
fr 8
p *((struct doveadm_connection *)io->context)
p *((struct doveadm_connection *)io->context)->input



Re: Auth process sometimes stop responding after upgrade

2018-09-18 Thread Simone Lazzaris
In data martedì 18 settembre 2018 14:25:25 CEST, Aki Tuomi ha scritto:
> Can you provide 'bt full'
> 

Sure:

(gdb) bt full
#0  0xb76e4428 in __kernel_vsyscall ()
No symbol table info available.
#1  0xb74636c1 in raise () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
No symbol table info available.
#2  0xb7466af2 in abort () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
No symbol table info available.
#3  0xb76485ae in default_fatal_finish (type=, 
status=status@entry=0) at 
failures.c:201
backtrace = 0x82b5168 "/usr/local/lib/dovecot/libdovecot.so.0(+0xa15be) 
[0xb76485be] -> /usr/local/lib/dovecot/libdovecot.so.0(+0xa1641) [0xb7648641] 
-> /usr/
local/lib/dovecot/libdovecot.so.0(i_fatal+0) [0xb75ce35e] -> dove"...
#4  0xb7648641 in i_internal_fatal_handler (ctx=0xbf839cc0, format=0x805c274 
"file %s: 
line %d (%s): assertion failed: (%s)", args=0xbf839ce4 "4\370\005\bI\004") at 
failures.c:670
status = 0
#5  0xb75ce35e in i_panic (format=format@entry=0x805c274 "file %s: line %d 
(%s): 
assertion failed: (%s)") at failures.c:275
ctx = {type = LOG_TYPE_PANIC, exit_status = 0, timestamp = 0x0, 
timestamp_usecs = 0}
args = 0xbf839ce4 "4\370\005\bI\004"
#6  0x080574f7 in doveadm_connection_deinit (_conn=_conn@entry=0xbf839d60) at 
doveadm-connection.c:1097
conn = 0x82fb580
__FUNCTION__ = "doveadm_connection_deinit"
#7  0x08057f03 in doveadm_connection_input (conn=0x0) at 
doveadm-connection.c:1051
line = 
ret = 
#8  0xb76613db in io_loop_call_io (io=0x82fb780) at ioloop.c:600
ioloop = 0x82bd648
t_id = 2
__FUNCTION__ = "io_loop_call_io"
#9  0xb7662e1e in io_loop_handler_run_internal (ioloop=ioloop@entry=0x82bd648) 
at 
ioloop-epoll.c:223
ctx = 0x82c9a40
events = 0x0
event = 0x82c9a80
list = 0x82e1830
io = 
tv = {tv_sec = 0, tv_usec = 236182}
events_count = 0
msecs = 
ret = 1
i = 
j = 
call = 
__FUNCTION__ = "io_loop_handler_run_internal"
#10 0xb7661496 in io_loop_handler_run (ioloop=ioloop@entry=0x82bd648) at 
ioloop.c:649
No locals.
#11 0xb7661658 in io_loop_run (ioloop=0x82bd648) at ioloop.c:624
__FUNCTION__ = "io_loop_run"
#12 0xb75da45e in master_service_run (service=0x82bd578, 
callback=callback@entry=0x804d360 ) at master-service.c:719
No locals.
#13 0x0804cf5e in main (argc=1, argv=0x82bd300) at main.c:366
set_roots = {0x805f6c0, 0x0}
test_port = 0
error = 
debug = false
c = 
(gdb) 


I have realized that on these machine used to run dovecot 2.1.x and were 
upgraded with 
"make install" on 2.2.36. I've found a library that is not upgraded (maybe is 
missing or not 
compiled on the new version).


-rw-r--r-- 1 root staff 1963428 Jun 17  2016 
/usr/local/lib/dovecot/libdovecot-ssl.a
-rwxr-xr-x 1 root staff1014 Jun 17  2016 
/usr/local/lib/dovecot/libdovecot-ssl.la
lrwxrwxrwx 1 root staff  23 Jun 17  2016 
/usr/local/lib/dovecot/libdovecot-ssl.so -> 
libdovecot-ssl.so.0.0.0
lrwxrwxrwx 1 root staff  23 Jun 17  2016 
/usr/local/lib/dovecot/libdovecot-ssl.so.0 -> 
libdovecot-ssl.so.0.0.0
-rwxr-xr-x 1 root staff 1284527 Jun 17  2016 
/usr/local/lib/dovecot/libdovecot-ssl.so.0.0.0

Can that be the ultimate cause?


*Simone Lazzaris*
*Qcom S.p.A.*
simone.lazza...@qcom.it[1] | www.qcom.it[2]
* LinkedIn[3]* | *Facebook[4]*
[5] 







[1] mailto:simone.lazza...@qcom.it
[2] https://www.qcom.it
[3] https://www.linkedin.com/company/qcom-spa
[4] http://www.facebook.com/qcomspa
[5] https://www.qcom.it/includes/email-banner.gif


Re: Auth process sometimes stop responding after upgrade

2018-09-18 Thread Aki Tuomi
Can you provide 'bt full'

Aki


On 18.09.2018 15:15, Simone Lazzaris wrote:
>
> In data martedì 18 settembre 2018 14:07:26 CEST, Aki Tuomi ha scritto:
>
> > If you are using systemd, create
>
> > /etc/systemd/system/dovecot.service.d/limits.conf and put
>
> > [Service]LimitCORE=infinity
>
> >
>
> > and run 
>
> > systemctl daemon-reloadsystemctl restart dovecot
>
>  
>
> Nope, I'm on a debian 7, without systemd. Anyway, I've resolved the
> issue: I had to set fs.suid_dumpable BEFORE starting dovecot.
>
>  
>
> I've got a core dump, and here is the backtrace. Let me know if you
> want the core file.
>
>  
>
> gdb) bt
>
> #0 0xb76e4428 in __kernel_vsyscall ()
>
> #1 0xb74636c1 in raise () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
>
> #2 0xb7466af2 in abort () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
>
> #3 0xb76485ae in default_fatal_finish (type=,
> status=status@entry=0) at failures.c:201
>
> #4 0xb7648641 in i_internal_fatal_handler (ctx=0xbf839cc0,
> format=0x805c274 "file %s: line %d (%s): assertion failed: (%s)",
> args=0xbf839ce4 "4\370\005\bI\004") at failures.c:670
>
> #5 0xb75ce35e in i_panic (format=format@entry=0x805c274 "file %s: line
> %d (%s): assertion failed: (%s)") at failures.c:275
>
> #6 0x080574f7 in doveadm_connection_deinit
> (_conn=_conn@entry=0xbf839d60) at doveadm-connection.c:1097
>
> #7 0x08057f03 in doveadm_connection_input (conn=0x0) at
> doveadm-connection.c:1051
>
> #8 0xb76613db in io_loop_call_io (io=0x82fb780) at ioloop.c:600
>
> #9 0xb7662e1e in io_loop_handler_run_internal
> (ioloop=ioloop@entry=0x82bd648) at ioloop-epoll.c:223
>
> #10 0xb7661496 in io_loop_handler_run (ioloop=ioloop@entry=0x82bd648)
> at ioloop.c:649
>
> #11 0xb7661658 in io_loop_run (ioloop=0x82bd648) at ioloop.c:624
>
> #12 0xb75da45e in master_service_run (service=0x82bd578,
> callback=callback@entry=0x804d360 ) at
> master-service.c:719
>
> #13 0x0804cf5e in main (argc=1, argv=0x82bd300) at main.c:366
>
>  
>
> -- 
>
> Simone LazzarisResponsabile datacenter Qcom S.p.A.Via Roggia Vignola,
> 9 | 24047 Treviglio (BG) T +39036347905 | D +3903631970352| M
> +393938111237 simone.lazza...@qcom.it
> | www.qcom.it
> Qcom Official PagesLinkedIn
> | Facebook
> 
>



Re: Auth process sometimes stop responding after upgrade

2018-09-18 Thread Simone Lazzaris
In data martedì 18 settembre 2018 14:07:26 CEST, Aki Tuomi ha scritto:
> If you are using systemd, create
> /etc/systemd/system/dovecot.service.d/limits.conf and put
> [Service]LimitCORE=infinity
> 
> and run 
> systemctl daemon-reloadsystemctl restart dovecot

Nope, I'm on a debian 7, without systemd. Anyway, I've resolved the issue: I 
had to set 
fs.suid_dumpable BEFORE starting dovecot.

I've got a core dump, and here is the backtrace. Let me know if you want the 
core file.

gdb) bt
#0  0xb76e4428 in __kernel_vsyscall ()
#1  0xb74636c1 in raise () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
#2  0xb7466af2 in abort () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
#3  0xb76485ae in default_fatal_finish (type=, 
status=status@entry=0) at 
failures.c:201
#4  0xb7648641 in i_internal_fatal_handler (ctx=0xbf839cc0, format=0x805c274 
"file %s: 
line %d (%s): assertion failed: (%s)", args=0xbf839ce4 "4\370\005\bI\004") at 
failures.c:670
#5  0xb75ce35e in i_panic (format=format@entry=0x805c274 "file %s: line %d 
(%s): 
assertion failed: (%s)") at failures.c:275
#6  0x080574f7 in doveadm_connection_deinit (_conn=_conn@entry=0xbf839d60) at 
doveadm-connection.c:1097
#7  0x08057f03 in doveadm_connection_input (conn=0x0) at 
doveadm-connection.c:1051
#8  0xb76613db in io_loop_call_io (io=0x82fb780) at ioloop.c:600
#9  0xb7662e1e in io_loop_handler_run_internal (ioloop=ioloop@entry=0x82bd648) 
at 
ioloop-epoll.c:223
#10 0xb7661496 in io_loop_handler_run (ioloop=ioloop@entry=0x82bd648) at 
ioloop.c:649
#11 0xb7661658 in io_loop_run (ioloop=0x82bd648) at ioloop.c:624
#12 0xb75da45e in master_service_run (service=0x82bd578, 
callback=callback@entry=0x804d360 ) at master-service.c:719
#13 0x0804cf5e in main (argc=1, argv=0x82bd300) at main.c:366



*Simone Lazzaris*
*Qcom S.p.A.*
simone.lazza...@qcom.it[1] | www.qcom.it[2]
* LinkedIn[3]* | *Facebook*[4]
[5] 







[1] mailto:simone.lazza...@qcom.it
[2] https://www.qcom.it
[3] https://www.linkedin.com/company/qcom-spa
[4] http://www.facebook.com/qcomspa
[5] https://www.qcom.it/includes/email-banner.gif


Re: Auth process sometimes stop responding after upgrade

2018-09-18 Thread Aki Tuomi
If you are using systemd, create 
/etc/systemd/system/dovecot.service.d/limits.conf and put
[Service]LimitCORE=infinity

and run 
systemctl daemon-reloadsystemctl restart dovecot
---Aki TuomiDovecot oy
 Original message From: Simone Lazzaris 
 Date: 18/09/2018  15:01  (GMT+02:00) To: Timo 
Sirainen  Cc: dovecot@dovecot.org Subject: Re: Auth process 
sometimes stop responding after upgrade 

Alas, I've set fs.suid_dumpable to 2 but the core is not dumped. So far I've 
checked:
 
- ulimit -c unlimited, done
- /proc/sys/kernel/core_pattern is set to /var/tmp/core.%p
- /var/tmp is chmod 1777
- daemon is not chrooted 
- sysctl -w fs.suid_dumpable=2
- dir /var/tmp is empty and filesystem has 2GB free
 
 
This is the logfile:
 
Sep 18 13:54:22 imap-front4 dovecot: director: doveadm: Host 192.168.1.145 
changed down (vhost_count=100 last_updown_change=0)
Sep 18 13:54:52 imap-front4 dovecot: director: doveadm: Host 192.168.1.145 
changed up (vhost_count=100 last_updown_change=1537271662)
Sep 18 13:54:52 imap-front4 dovecot: director: Panic: file 
doveadm-connection.c: line 1097 (doveadm_connection_deinit): assertion failed: 
(conn->to_ring_sync_abort == NULL)
Sep 18 13:54:52 imap-front4 dovecot: director: Error: Raw backtrace: 
/usr/local/lib/dovecot/libdovecot.so.0(+0xa15be) [0xb76fa5be] -> 
/usr/local/lib/dovecot/libdovecot.so.0(+0xa1641) [0xb76fa641] -> 
/usr/local/lib/dovecot/libdovecot.so.0(i_fatal+0) [0xb768035e] -> 
dovecot/director() [0x80574f7] -> dovecot/director() [0x8057f03] -> 
/usr/local/lib/dovecot/libdovecot.so.0(io_loop_call_io+0x6b) [0xb77133db] -> 
/usr/local/lib/dovecot/libdovecot.so.0(io_loop_handler_run_internal+0xfe) 
[0xb7714e1e] -> 
/usr/local/lib/dovecot/libdovecot.so.0(io_loop_handler_run+0x46) [0xb7713496] 
-> /usr/local/lib/dovecot/libdovecot.so.0(io_loop_run+0x48) [0xb7713658] -> 
/usr/local/lib/dovecot/libdovecot.so.0(master_service_run+0x2e) [0xb768c45e] -> 
dovecot/director(main+0x49e) [0x804cf5e] -> 
/lib/i386-linux-gnu/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0xb7501e46] -> 
dovecot/director() [0x804d081]
Sep 18 13:54:52 imap-front4 dovecot: director: Fatal: master: 
service(director): child 8059 killed with signal 6 (core not dumped)
Sep 18 13:54:52 imap-front4 dovecot: director: Connecting to 
212.183.164.161:9090 (as 212.183.164.164): Initial connection
Sep 18 13:54:52 imap-front4 dovecot: director: Incoming connection from 
director 212.183.164.163/in
Sep 18 13:54:52 imap-front4 dovecot: director: director(212.183.164.163/in): 
Host 192.168.1.142 vhost count changed from 100 to 100
Sep 18 13:54:52 imap-front4 dovecot: director: director(212.183.164.163/in): 
Host 192.168.1.143 vhost count changed from 100 to 100
Sep 18 13:54:52 imap-front4 dovecot: director: director(212.183.164.163/in): 
Host 192.168.1.144 vhost count changed from 100 to 100
Sep 18 13:54:52 imap-front4 dovecot: director: director(212.183.164.163/in): 
Host 192.168.1.145 vhost count changed from 100 to 100
Sep 18 13:54:52 imap-front4 dovecot: director: director(212.183.164.163/in): 
Host 192.168.1.216 vhost count changed from 100 to 100
Sep 18 13:54:52 imap-front4 dovecot: director: director(212.183.164.163/in): 
Host 192.168.1.217 vhost count changed from 100 to 100
Sep 18 13:54:52 imap-front4 dovecot: director: director(212.183.164.163/in): 
Host 192.168.1.218 vhost count changed from 100 to 100
Sep 18 13:54:52 imap-front4 dovecot: director: director(212.183.164.163/in): 
Host 192.168.1.219 vhost count changed from 100 to 100
Sep 18 13:54:52 imap-front4 dovecot: director: 
director(212.183.164.161:9090/out): Handshake finished in 0.001 secs (bytes 
in=61, bytes out=791, 0+0 USERs received, last input 0.000 s ago, last output 
0.001 s ago, connected 0.001 s ago, 408 peak output buffer size, 0.000 CPU secs 
since connected)
Sep 18 13:54:52 imap-front4 dovecot: director: director(212.183.164.163/in): 
Handshake finished in 0.006 secs (bytes in=111411, bytes out=56, 2940+0 USERs 
received, last input 0.000 s ago, last output 0.006 s ago, connected 0.006 s 
ago, 0 peak output buffer size, 0.004 CPU secs since connected)
 
 
I can confirm that I can trigger the issue having one of the backends flapping 
down/up.

-- 
Simone Lazzaris
Responsabile datacenter

Qcom S.p.A.
Via Roggia Vignola, 9 | 24047 Treviglio (BG)
T +39036347905 | D +3903631970352| M +393938111237
simone.lazza...@qcom.it | www.qcom.it

Qcom Official Pages LinkedIn | Facebook

 





Re: Auth process sometimes stop responding after upgrade

2018-09-18 Thread Simone Lazzaris
Alas, I've set fs.suid_dumpable to 2 but the core is not dumped. So far I've 
checked:

- ulimit -c unlimited, done
- /proc/sys/kernel/core_pattern is set to /var/tmp/core.%p
- /var/tmp is chmod 1777
- daemon is not chrooted 
- sysctl -w fs.suid_dumpable=2
- dir /var/tmp is empty and filesystem has 2GB free


This is the logfile:

Sep 18 13:54:22 imap-front4 dovecot: director: doveadm: Host 192.168.1.145 
changed 
down (vhost_count=100 last_updown_change=0)
Sep 18 13:54:52 imap-front4 dovecot: director: doveadm: Host 192.168.1.145 
changed up 
(vhost_count=100 last_updown_change=1537271662)
Sep 18 13:54:52 imap-front4 dovecot: director: Panic: file 
doveadm-connection.c: line 1097 
(doveadm_connection_deinit): assertion failed: (conn->to_ring_sync_abort == 
NULL)
Sep 18 13:54:52 imap-front4 dovecot: director: Error: Raw backtrace: 
/usr/local/lib/dovecot/
libdovecot.so.0(+0xa15be) [0xb76fa5be] -> /usr/local/lib/dovecot/libdovecot.so.
0(+0xa1641) [0xb76fa641] -> /usr/local/lib/dovecot/libdovecot.so.0(i_fatal+0) 
[0xb768035e] 
-> dovecot/director() [0x80574f7] -> dovecot/director() [0x8057f03] -> 
/usr/local/lib/
dovecot/libdovecot.so.0(io_loop_call_io+0x6b) [0xb77133db] -> 
/usr/local/lib/dovecot/
libdovecot.so.0(io_loop_handler_run_internal+0xfe) [0xb7714e1e] -> 
/usr/local/lib/dovecot/
libdovecot.so.0(io_loop_handler_run+0x46) [0xb7713496] -> 
/usr/local/lib/dovecot/
libdovecot.so.0(io_loop_run+0x48) [0xb7713658] -> 
/usr/local/lib/dovecot/libdovecot.so.
0(master_service_run+0x2e) [0xb768c45e] -> dovecot/director(main+0x49e) 
[0x804cf5e] -> 
/lib/i386-linux-gnu/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0xb7501e46] -> 
dovecot/
director() [0x804d081]
Sep 18 13:54:52 imap-front4 dovecot: director: Fatal: master: 
service(director): child 8059 
killed with signal 6 (core not dumped)
Sep 18 13:54:52 imap-front4 dovecot: director: Connecting to 
212.183.164.161:9090 (as 
212.183.164.164): Initial connection
Sep 18 13:54:52 imap-front4 dovecot: director: Incoming connection from 
director 
212.183.164.163/in
Sep 18 13:54:52 imap-front4 dovecot: director: director(212.183.164.163/in): 
Host 
192.168.1.142 vhost count changed from 100 to 100
Sep 18 13:54:52 imap-front4 dovecot: director: director(212.183.164.163/in): 
Host 
192.168.1.143 vhost count changed from 100 to 100
Sep 18 13:54:52 imap-front4 dovecot: director: director(212.183.164.163/in): 
Host 
192.168.1.144 vhost count changed from 100 to 100
Sep 18 13:54:52 imap-front4 dovecot: director: director(212.183.164.163/in): 
Host 
192.168.1.145 vhost count changed from 100 to 100
Sep 18 13:54:52 imap-front4 dovecot: director: director(212.183.164.163/in): 
Host 
192.168.1.216 vhost count changed from 100 to 100
Sep 18 13:54:52 imap-front4 dovecot: director: director(212.183.164.163/in): 
Host 
192.168.1.217 vhost count changed from 100 to 100
Sep 18 13:54:52 imap-front4 dovecot: director: director(212.183.164.163/in): 
Host 
192.168.1.218 vhost count changed from 100 to 100
Sep 18 13:54:52 imap-front4 dovecot: director: director(212.183.164.163/in): 
Host 
192.168.1.219 vhost count changed from 100 to 100
Sep 18 13:54:52 imap-front4 dovecot: director: 
director(212.183.164.161:9090/out): 
Handshake finished in 0.001 secs (bytes in=61, bytes out=791, 0+0 USERs 
received, last 
input 0.000 s ago, last output 0.001 s ago, connected 0.001 s ago, 408 peak 
output buffer 
size, 0.000 CPU secs since connected)
Sep 18 13:54:52 imap-front4 dovecot: director: director(212.183.164.163/in): 
Handshake 
finished in 0.006 secs (bytes in=111411, bytes out=56, 2940+0 USERs received, 
last input 
0.000 s ago, last output 0.006 s ago, connected 0.006 s ago, 0 peak output 
buffer size, 
0.004 CPU secs since connected)


I can confirm that I can trigger the issue having one of the backends flapping 
down/up.


*Simone Lazzaris*
*Qcom S.p.A.*
simone.lazza...@qcom.it[1] | www.qcom.it[2]
* LinkedIn[3]* | *Facebook*[4]
[5] 







[1] mailto:simone.lazza...@qcom.it
[2] https://www.qcom.it
[3] https://www.linkedin.com/company/qcom-spa
[4] http://www.facebook.com/qcomspa
[5] https://www.qcom.it/includes/email-banner.gif


Re: Auth process sometimes stop responding after upgrade

2018-09-18 Thread Timo Sirainen
On 18 Sep 2018, at 13.29, Simone Lazzaris  wrote:
> 
> > Hi all, again;
> >
> > I've enabled the core dumps and let it go for some day waiting for the issue
> > to reoccur.
> >
> > Meantime I've also upgraded the poolmon script, as Sami suggested.
> >
> > It seems that the upgrade has scared the issue away, because it no longer
> > occurred.
> >
> > Maybe the problem is related to the way the old poolmon talked to the
> > director daemon? I'm not very inclined to downgrade poolmon to catch a
> > traceback, but can do if neccessary.
>  
> Well, maybe it's not necessary ;)
> I've performed some maintenance operations on the backends and that triggered 
> the crash. It seems that something goes wrong where one backend come back 
> online.

It's weird how easily you can reproduce the crash. I've ran all kinds of 
(stress) tests and I can't reproduce this crash. I was able to reproduce the 
original hang though.
 
> Unfortunately, the core was not dumped And I don't know what to do: the 
> director service was not chrooted, and ulimit -c is unlimited.

Do you have: sysctl -w fs.suid_dumpable=2



Re: Auth process sometimes stop responding after upgrade

2018-09-18 Thread Simone Lazzaris
> Hi all, again;
> 
> I've enabled the core dumps and let it go for some day waiting for the issue
> to reoccur.
> 
> Meantime I've also upgraded the poolmon script, as Sami suggested.
> 
> It seems that the upgrade has scared the issue away, because it no longer
> occurred.
> 
> Maybe the problem is related to the way the old poolmon talked to the
> director daemon? I'm not very inclined to downgrade poolmon to catch a
> traceback, but can do if neccessary.

Well, maybe it's not necessary ;)
I've performed some maintenance operations on the backends and that triggered 
the 
crash. It seems that something goes wrong where one backend come back online.

Unfortunately, the core was not dumped And I don't know what to do: the 
director 
service was not chrooted, and ulimit -c is unlimited.

>From the log file:

Sep 18 12:21:46 imap-front4 dovecot: director: Panic: file 
doveadm-connection.c: line 1097 
(doveadm_connection_deinit): assertion failed: (conn->to_ring_sync_abort == 
NULL)
Sep 18 12:21:46 imap-front4 dovecot: director: Error: Raw backtrace: 
/usr/local/lib/dovecot/
libdovecot.so.0(+0xa15be) [0xb77345be] -> /usr/local/lib/dovecot/libdovecot.so.
0(+0xa1641) [0xb7734641] -> /usr/local/lib/dovecot/libdovecot.so.0(i_fatal+0) 
[0xb76ba35e] 
-> dovecot/director() [0x80574f7] -> dovecot/director() [0x8057f03] -> 
/usr/local/lib/
dovecot/libdovecot.so.0(io_loop_call_io+0x6b) [0xb774d3db] -> 
/usr/local/lib/dovecot/
libdovecot.so.0(io_loop_handler_run_internal+0xfe) [0xb774ee1e] -> 
/usr/local/lib/dovecot/
libdovecot.so.0(io_loop_handler_run+0x46) [0xb774d496] -> 
/usr/local/lib/dovecot/
libdovecot.so.0(io_loop_run+0x48) [0xb774d658] -> 
/usr/local/lib/dovecot/libdovecot.so.
0(master_service_run+0x2e) [0xb76c645e] -> dovecot/director(main+0x49e) 
[0x804cf5e] -> 
/lib/i386-linux-gnu/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0xb753be46] -> 
dovecot/
director() [0x804d081]
Sep 18 12:21:46 imap-front4 dovecot: director: Fatal: master: 
service(director): child 7941 
killed with signal 6 (core not dumped)
Sep 18 12:21:46 imap-front4 dovecot: director: Connecting to 
212.183.164.161:9090 (as 
212.183.164.164): Initial connection
Sep 18 12:21:46 imap-front4 dovecot: director: Incoming connection from 
director 
212.183.164.163/in

My current config:

root@imap-front4:~# doveconf -n
# 2.2.36 (1f10bfa63): /usr/local/etc/dovecot/dovecot.conf
# OS: Linux 3.2.0-4-686-pae i686 Debian 7.11 
# Hostname: imap-front4
auth_mechanisms = plain login digest-md5 cram-md5 apop scram-sha-1
auth_verbose = yes
auth_verbose_passwords = plain
base_dir = /var/run/dovecot/
default_login_user = nobody
director_doveadm_port = 9091
director_mail_servers = 192.168.1.142 192.168.1.143 192.168.1.216 192.168.1.217 
192.168.1.218 192.168.1.219
director_servers = 212.183.164.161 212.183.164.162 212.183.164.163 
212.183.164.164
disable_plaintext_auth = no
listen = *
passdb {
  args = /usr/local/etc/dovecot/sql.conf
  driver = sql
}
protocols = imap pop3
service director {
  chroot = 
  fifo_listener login/proxy-notify {
mode = 0666
  }
  inet_listener {
port = 9090
  }
  unix_listener director-userdb {
mode = 0600
  }
  unix_listener login/director {
mode = 0666
  }
}
service imap-login {
  executable = imap-login director
  service_count = 0
  vsz_limit = 128 M
}
service pop3-login {
  executable = pop3-login director
  service_count = 0
  vsz_limit = 128 M
}
ssl_cert = 

Re: Auth process sometimes stop responding after upgrade

2018-09-18 Thread Simone Lazzaris
In data martedì 11 settembre 2018 10:46:30 CEST, Timo Sirainen ha scritto:
> On 11 Sep 2018, at 10.57, Simone Lazzaris  wrote:
> > Sep 11 03:25:55 imap-front4 dovecot: director: Panic: file
> > doveadm-connection.c: line 1097 (doveadm_connection_deinit): assertion
> > failed: (conn->to_ring_sync_abort == NULL) Sep 11 03:25:55 imap-front4
> > dovecot: director: Fatal: master: service(director): child 4395 killed
> > with signal 6 (core dumps disabled)
> It's crashing. Can you get gdb backtrace? First enable core dumps.
> https://dovecot.org/bugreport.html#coredumps
> 

Hi all, again;

I've enabled the core dumps and let it go for some day waiting for the issue to 
reoccur.

Meantime I've also upgraded the poolmon script, as Sami suggested.

It seems that the upgrade has scared the issue away, because it no longer 
occurred.

Maybe the problem is related to the way the old poolmon talked to the director 
daemon? 
I'm not very inclined to downgrade poolmon to catch a traceback, but can do if 
neccessary.


-- 
*Simone Lazzaris*
*Qcom S.p.A.*
simone.lazza...@qcom.it[1] | www.qcom.it[2]
* LinkedIn[3]* | *Facebook*[4]
[5] 







[1] mailto:simone.lazza...@qcom.it
[2] https://www.qcom.it
[3] https://www.linkedin.com/company/qcom-spa
[4] http://www.facebook.com/qcomspa
[5] https://www.qcom.it/includes/email-banner.gif


Re: Auth process sometimes stop responding after upgrade

2018-09-18 Thread Simone Lazzaris
In data martedì 11 settembre 2018 10:46:30 CEST, Timo Sirainen ha scritto:
> On 11 Sep 2018, at 10.57, Simone Lazzaris  wrote:
> > Sep 11 03:25:55 imap-front4 dovecot: director: Panic: file
> > doveadm-connection.c: line 1097 (doveadm_connection_deinit): assertion
> > failed: (conn->to_ring_sync_abort == NULL) Sep 11 03:25:55 imap-front4
> > dovecot: director: Fatal: master: service(director): child 4395 killed
> > with signal 6 (core dumps disabled)
> It's crashing. Can you get gdb backtrace? First enable core dumps.
> https://dovecot.org/bugreport.html#coredumps
> 

Hi all, again;

I've enabled the core dumps and let it go for some day waiting for the issue to 
reoccur.

Meantime I've also upgraded the poolmon script, as Sami suggested.

It seems that the upgrade has scared the issue away, because it no longer 
occurred.

Maybe the problem is related to the way the old poolmon talked to the director 
daemon? 
I'm not very inclined to downgrade poolmon to catch a traceback, but can do if 
neccessary.


-- 
*Simone Lazzaris*
*Qcom S.p.A.*
simone.lazza...@qcom.it[1] | www.qcom.it[2]
* LinkedIn[3]* | *Facebook*[4]
[5] 







[1] mailto:simone.lazza...@qcom.it
[2] https://www.qcom.it
[3] https://www.linkedin.com/company/qcom-spa
[4] http://www.facebook.com/qcomspa
[5] https://www.qcom.it/includes/email-banner.gif


Re: Auth process sometimes stop responding after upgrade

2018-09-11 Thread Timo Sirainen
On 11 Sep 2018, at 10.57, Simone Lazzaris  wrote:
> 
> Sep 11 03:25:55 imap-front4 dovecot: director: Panic: file 
> doveadm-connection.c: line 1097 (doveadm_connection_deinit): assertion 
> failed: (conn->to_ring_sync_abort == NULL)
> Sep 11 03:25:55 imap-front4 dovecot: director: Fatal: master: 
> service(director): child 4395 killed with signal 6 (core dumps disabled)

It's crashing. Can you get gdb backtrace? First enable core dumps. 
https://dovecot.org/bugreport.html#coredumps 




Re: Auth process sometimes stop responding after upgrade

2018-09-11 Thread Simone Lazzaris
In data lunedì 10 settembre 2018 09:58:50 CEST, Timo Sirainen ha scritto:
> On 8 Sep 2018, at 15.18, Simone Lazzaris  wrote:
> > Timo, unfortunately the patch doesn't compile;  I've moved the declaration
> > of "conn" one line up to make it work.
> 
> Oops, I guess I was too much in a hurry to even compile it. Here's a new
> patch that compiles and passes our director CI tests.

Hi Timo;
after 24 hours of field testing, I can say that the issue is mostly gone. I say 
"mostly" 
because the service is working as far as the user is concerned, but I see some 
strange 
going on in the logs.

Grepping "director" in the log file, I can see that there are some panic and 
some 
comunication errors:

Sep 11 03:24:55 imap-front4 dovecot: director: doveadm: Host 192.168.1.143 
vhost count 
changed from 100 to 0
Sep 11 03:24:55 imap-front4 dovecot: director: doveadm: Host 192.168.1.219 
vhost count 
changed from 100 to 0
Sep 11 03:24:55 imap-front4 dovecot: director: doveadm: Host 192.168.1.218 
vhost count 
changed from 100 to 0
Sep 11 03:24:55 imap-front4 dovecot: director: doveadm: Host 192.168.1.216 
vhost count 
changed from 100 to 0
Sep 11 03:24:55 imap-front4 dovecot: director: 
director(212.183.164.161:9090/right): Host 
192.168.1.145 vhost count changed from 100 to 0
Sep 11 03:24:55 imap-front4 dovecot: director: doveadm: Host 192.168.1.217 
vhost count 
changed from 100 to 0
Sep 11 03:24:55 imap-front4 dovecot: director: doveadm: Host 192.168.1.144 
vhost count 
changed from 100 to 0
Sep 11 03:24:55 imap-front4 dovecot: director: doveadm: Host 192.168.1.145 
vhost count 
changed from 0 to 0
Sep 11 03:24:55 imap-front4 dovecot: director: doveadm: Host 192.168.1.142 
vhost count 
changed from 100 to 0
Sep 11 03:25:09 imap-front4 dovecot: director: 
director(212.183.164.161:9090/right): Host 
192.168.1.143 vhost count changed from 0 to 100
Sep 11 03:25:09 imap-front4 dovecot: director: Error: Director 
212.183.164.161:9090/right 
disconnected: Connection closed (bytes in=1116368, bytes out=1182555, 0+27319 
USERs 
received, last input 0.000 s ago, last output 0.000 s ago, connected 4602.589 s 
ago, 481 
peak output buffer size, 1.948 CPU secs since connected)
Sep 11 03:25:09 imap-front4 dovecot: director: Connecting to 
212.183.164.161:9090 (as 
212.183.164.164): Reconnecting after disconnection
Sep 11 03:25:09 imap-front4 dovecot: director: Error: Director 
212.183.164.161:9090/out 
disconnected: Connection closed: read(size=968) failed: Connection reset by 
peer (bytes 
in=56, bytes out=59143, 0+0 USERs received, 1556 USERs sent in handshake, last 
input 
0.002 s ago, last output 0.002 s ago, connected 0.024 s ago, 8190 peak output 
buffer size, 
0.004 CPU secs since connected, handshake DONE not received)
Sep 11 03:25:09 imap-front4 dovecot: director: Connecting to 
212.183.164.162:9090 (as 
212.183.164.164): Reconnecting after disconnection
Sep 11 03:25:09 imap-front4 dovecot: director: 
director(212.183.164.162:9090/out): 
Handshake finished in 0.006 secs (bytes in=61, bytes out=59173, 0+0 USERs 
received, 1556 
USERs sent in handshake, last input 0.000 s ago, last output 0.003 s ago, 
connected 0.006 s 
ago, 8190 peak output buffer size, 0.000 CPU secs since connected)
Sep 11 03:25:10 imap-front4 dovecot: director: Connecting to 
212.183.164.161:9090 (as 
212.183.164.164): Received CONNECT request from 212.183.164.162:9090/right - 
replacing 
current right 212.183.164.162:9090/right
Sep 11 03:25:10 imap-front4 dovecot: director: 
director(212.183.164.161:9090/out): 
Handshake finished in 0.004 secs (bytes in=61, bytes out=59332, 0+0 USERs 
received, 1561 
USERs sent in handshake, last input 0.000 s ago, last output 0.004 s ago, 
connected 0.004 s 
ago, 8190 peak output buffer size, 0.000 CPU secs since connected)
Sep 11 03:25:10 imap-front4 dovecot: director: 
director(212.183.164.161:9090/right): Host 
192.168.1.216 vhost count changed from 0 to 100
Sep 11 03:25:10 imap-front4 dovecot: director: Error: Director 
212.183.164.161:9090/right 
disconnected: Connection closed: read(size=558) failed: Connection reset by 
peer (bytes 
in=466, bytes out=60271, 0+6 USERs received, 1561 USERs sent in handshake, last 
input 
0.001 s ago, last output 0.000 s ago, connected 0.553 s ago, 8190 peak output 
buffer size, 
0.000 CPU secs since connected)
Sep 11 03:25:10 imap-front4 dovecot: director: Connecting to 
212.183.164.162:9090 (as 
212.183.164.164): Reconnecting after disconnection
Sep 11 03:25:10 imap-front4 dovecot: director: 
director(212.183.164.162:9090/out): 
Handshake finished in 0.005 secs (bytes in=61, bytes out=59372, 0+0 USERs 
received, 1562 
USERs sent in handshake, last input 0.000 s ago, last output 0.005 s ago, 
connected 0.005 s 
ago, 8192 peak output buffer size, 0.000 CPU secs since connected)
Sep 11 03:25:10 imap-front4 dovecot: director: Connecting to 
212.183.164.161:9090 (as 
212.183.164.164): Received CONNECT request from 212.183.164.162:9090/right - 
replacing 
current right 

Re: Auth process sometimes stop responding after upgrade

2018-09-10 Thread Simone Lazzaris
In data lunedì 10 settembre 2018 09:58:50 CEST, Timo Sirainen ha scritto:
> On 8 Sep 2018, at 15.18, Simone Lazzaris  wrote:
> > Timo, unfortunately the patch doesn't compile;  I've moved the declaration
> > of "conn" one line up to make it work.
> 
> Oops, I guess I was too much in a hurry to even compile it. Here's a new
> patch that compiles and passes our director CI tests.


This one is better :) I've compiled and installed the patched version on one 
VM and it's working. In the next hours, if everything is ok, I'll percolate 
the change on the whole cluster. Let's see if the issue appears again.

The other cluster have been downgraded (saturday night) to 2.2.30.2 and it's 
working fine: I couldn't afford to have issues on that one.




-- 
*Simone Lazzaris*
*Qcom S.p.A.*





Re: Auth process sometimes stop responding after upgrade

2018-09-10 Thread Timo Sirainen
On 8 Sep 2018, at 15.18, Simone Lazzaris  wrote:Timo, unfortunately the patch doesn't compile;  I've moved the declaration of "conn" one line up to make it work. Oops, I guess I was too much in a hurry to even compile it. Here's a new patch that compiles and passes our director CI tests.

diff
Description: Binary data


Re: Auth process sometimes stop responding after upgrade

2018-09-08 Thread Simone Lazzaris
In data venerdì 7 settembre 2018 19:11:24 CEST, Timo Sirainen ha scritto:
> On 7 Sep 2018, at 19.43, Timo Sirainen  wrote:
> > On 7 Sep 2018, at 16.50, Simone Lazzaris mailto:s.lazza...@interactive.eu>> wrote:
> >> Some more information: the issue has just occurred, again on an instance
> >> without the "service_count = 0" configuration directive on pop3-login.
> >> 
> >> I've observed that while the issue is occurring, the director process
> >> goes 100% CPU. I've straced the process. It is seemingly looping:
> >> 
> >> ...
> >> ...
> >> epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP,
> >> {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78,
> >> {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD,
> >> 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320,
> >> u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0,
> >> {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78,
> >> {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) =
> >> 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}})
> >> = 0> 
> > Nothing else but these epoll_ctl() calls? So it's gone to some loop where
> > it keeps calling io_add() and io_remove().
> I'm guessing it's because of doveadm command handling issues, since there's
> some weirdness in the code. Although I couldn't figure out exactly why it
> would go to infinite loop there. But attached a patch that may fix it, if
> you're able to test. We haven't noticed such infinite looping in other
> installations or automated director stresstests though..
> >> FD 13 is "anon_inode:[eventpoll]"
> > 
> > What about fd 78? I guess some socket.
> > 
> > Could you also try two more things when it happens again:
> > 
> > ltrace -tt -e '*' -o ltrace.log -p 
> > (My guess this isn't going to be very useful, but just in case it might
> > be..)
> > 
> > gdb -p 
> > bt full
> > quit
> > 
> > Preferably install dovecot-dbg package also so the gdb backtrace output
> > will be better.
> These would still be useful to verify whether I'm even on the right track.

Timo, unfortunately the patch doesn't compile;  I've moved the declaration of 
"conn" one line up to make it work. 

In other words, the function becomes like this:
void doveadm_connections_ring_synced(void)
{
struct doveadm_connection *conn = 
doveadm_ring_sync_pending_connections;
while (doveadm_ring_sync_pending_connections != NULL &&
   conn->dir->ring_synced) {
doveadm_connection_ring_sync_callback_t *callback =
conn->ring_sync_callback;

conn->ring_sync_callback = NULL;
doveadm_connection_ring_sync_list_move(conn);
doveadm_connection_set_io(conn);
io_set_pending(conn->io);
callback(conn);
}
if (conn->dir->ring_synced)
doveadm_connections_continue_reset_cmds();
}

I've tried the patched version but without luck: the director keeps crashing:

Sep  8 14:10:56 imapfront2 dovecot: director: Connecting to 
212.183.164.163:9090 (as 212.183.164.162): Initial connection
Sep  8 14:10:56 imapfront2 dovecot: director: Incoming connection from 
director 212.183.164.161/in
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.163:9090/
out): Handshake finished in 0.002 secs (bytes in=61, bytes out=473, 0+0 USERs 
received, last input 0.000 s ago, last output 0.000 s ago, connected 0.002 s 
ago, 207 peak output buffer size, 0.000 CPU secs since connected)
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Host 192.168.1.142 vhost count changed from 100 to 100
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Host 192.168.1.143 vhost count changed from 100 to 100
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Host 192.168.1.144 vhost count changed from 100 to 100
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Host 192.168.1.145 vhost count changed from 100 to 100
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Host 192.168.1.216 vhost count changed from 100 to 100
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Host 192.168.1.217 vhost count changed from 100 to 100
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Host 192.168.1.218 vhost count changed from 100 to 100
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Host 192.168.1.219 vhost count changed from 100 to 100
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Handshake finished in 0.005 secs (bytes in=67951, bytes out=56, 1788+0 USERs 
received, last input 0.000 s ago, last output 0.005 s ago, connected 0.005 s 
ago, 0 peak output buffer size, 0.004 CPU secs since connected)
Sep  8 14:10:56 imapfront2 dovecot: 

Re: Auth process sometimes stop responding after upgrade

2018-09-08 Thread Simone Lazzaris
In data venerdì 7 settembre 2018 18:43:36 CEST, Timo Sirainen ha scritto:

Hi Timo;
it happened again, this time on a "high-performance" instance (e.g., WITH 
service_count=0)

> On 7 Sep 2018, at 16.50, Simone Lazzaris  wrote:
> > Some more information: the issue has just occurred, again on an instance
> > without the "service_count = 0" configuration directive on pop3-login.
> Could you also try two more things when it happens again:
> 
> ltrace -tt -e '*' -o ltrace.log -p 
> (My guess this isn't going to be very useful, but just in case it might
> be..)

Done; unfortunately, ltrace.log is empty.

> gdb -p 
> bt full
> quit
> 

Here it is:

root@imapfront2:/usr/local/src/dovecot-2.2.36/src# gdb -p 31635
GNU gdb (GDB) 7.4.1-debian
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i486-linux-gnu".
For bug reporting instructions, please see:
.
Attaching to process 31635
Reading symbols from /usr/local/libexec/dovecot/director...done.
Reading symbols from /usr/local/lib/dovecot/libdovecot.so.0...done.
Loaded symbols for /usr/local/lib/dovecot/libdovecot.so.0
Reading symbols from /lib/i386-linux-gnu/i686/cmov/libc.so.6...(no debugging 
symbols found)...done.
Loaded symbols for /lib/i386-linux-gnu/i686/cmov/libc.so.6
Reading symbols from /lib/i386-linux-gnu/i686/cmov/libdl.so.2...(no debugging 
symbols found)...done.
Loaded symbols for /lib/i386-linux-gnu/i686/cmov/libdl.so.2
Reading symbols from /lib/i386-linux-gnu/i686/cmov/librt.so.1...(no debugging 
symbols found)...done.
Loaded symbols for /lib/i386-linux-gnu/i686/cmov/librt.so.1
Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0...(no 
debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/i386-linux-gnu/i686/cmov/
libthread_db.so.1".
Loaded symbols for /lib/i386-linux-gnu/i686/cmov/libpthread.so.0
0xb76fa428 in __kernel_vsyscall ()
(gdb) bt full
#0  0xb76fa428 in __kernel_vsyscall ()
No symbol table info available.
#1  0xb752929e in epoll_ctl () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
No symbol table info available.
#2  0xb7678aa0 in io_loop_handle_add (io=io@entry=0x815db90) at ioloop-
epoll.c:106
ctx = 0x8144e98
list = 0x814556c
event = {events = 27, data = {ptr = 0x815c160, fd = 135643488, u32 = 
135643488, u64 = 135643488}}
op = 1
first = 
#3  0xb7676a7d in io_add_file (fd=fd@entry=27, 
condition=condition@entry=IO_READ, 
source_filename=source_filename@entry=0x805f7f4 "doveadm-connection.c", 
source_linenum=source_linenum@entry=1070, 
callback=callback@entry=0x8057e10 , 
context=context@entry=0x8166398) at ioloop.c:64
io = 0x815db90
__FUNCTION__ = "io_add_file"
#4  0xb7676bef in io_add (fd=27, condition=condition@entry=IO_READ, 
source_filename=source_filename@entry=0x805f7f4 "doveadm-connection.c", 
source_linenum=source_linenum@entry=1070, 
callback=callback@entry=0x8057e10 , 
context=context@entry=0x8166398) at ioloop.c:87
io = 
__FUNCTION__ = "io_add"
#5  0x08057135 in doveadm_connection_set_io (conn=conn@entry=0x8166398) at 
doveadm-connection.c:1070
No locals.
#6  0x08059528 in doveadm_connections_ring_synced () at doveadm-connection.c:
1171
conn = 0x8166398
callback = 0x8057f90 
#7  0xb76777b7 in io_loop_handle_timeouts_real (ioloop=0x8138678) at ioloop.c:
568
timeout = 0x8165a38
item = 0x8165a38
tv = {tv_sec = 0, tv_usec = 0}
tv_call = {tv_sec = 1536404874, tv_usec = 999375}
t_id = 3
#8  io_loop_handle_timeouts (ioloop=ioloop@entry=0x8138678) at ioloop.c:581
_data_stack_cur_id = 2
#9  0xb7678dd2 in io_loop_handler_run_internal (ioloop=ioloop@entry=0x8138678) 
at ioloop-epoll.c:196
ctx = 0x8144e98
events = 0x0
event = 0x8138678
list = 
io = 
tv = {tv_sec = 0, tv_usec = 0}
events_count = 15
msecs = 0
ret = 0
i = 
j = 
call = 
__FUNCTION__ = "io_loop_handler_run_internal"
#10 0xb7677496 in io_loop_handler_run (ioloop=ioloop@entry=0x8138678) at 
ioloop.c:649
No locals.
---Type  to continue, or q  to quit---
#11 0xb7677658 in io_loop_run (ioloop=0x8138678) at ioloop.c:624
__FUNCTION__ = "io_loop_run"
#12 0xb75f045e in master_service_run (service=0x81385a8, 
callback=callback@entry=0x804d360 ) at master-service.c:719
No locals.
#13 0x0804cf5e in main (argc=1, argv=0x8138300) at main.c:366
set_roots = {0x805f680, 0x0}
test_port = 0
error = 
debug 

Re: Auth process sometimes stop responding after upgrade

2018-09-07 Thread Timo Sirainen
On 7 Sep 2018, at 19.43, Timo Sirainen  wrote:On 7 Sep 2018, at 16.50, Simone Lazzaris  wrote:Some more information: the issue has just occurred, again on an instance without the "service_count = 0" configuration directive on pop3-login. I've observed that while the issue is occurring, the director process goes 100% CPU. I've straced the process. It is seemingly looping: ..epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0Nothing else but these epoll_ctl() calls? So it's gone to some loop where it keeps calling io_add() and io_remove(). I'm guessing it's because of doveadm command handling issues, since there's some weirdness in the code. Although I couldn't figure out exactly why it would go to infinite loop there. But attached a patch that may fix it, if you're able to test. We haven't noticed such infinite looping in other installations or automated director stresstests though..

2417.patch
Description: Binary data
FD 13 is "anon_inode:[eventpoll]"What about fd 78? I guess some socket.Could you also try two more things when it happens again:ltrace -tt -e '*' -o ltrace.log -p (My guess this isn't going to be very useful, but just in case it might be..)gdb -p bt fullquitPreferably install dovecot-dbg package also so the gdb backtrace output will be better.These would still be useful to verify whether I'm even on the right track.

Re: Auth process sometimes stop responding after upgrade

2018-09-07 Thread Timo Sirainen
On 7 Sep 2018, at 16.50, Simone Lazzaris  wrote:
> 
> Some more information: the issue has just occurred, again on an instance 
> without the "service_count = 0" configuration directive on pop3-login.
>  
> I've observed that while the issue is occurring, the director process goes 
> 100% CPU. I've straced the process. It is seemingly looping:
>  
> ...
> ...
> epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, 
> {u32=149035320, u64=149035320}}) = 0
> epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
> epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, 
> {u32=149035320, u64=149035320}}) = 0
> epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
> epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, 
> {u32=149035320, u64=149035320}}) = 0
> epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0

Nothing else but these epoll_ctl() calls? So it's gone to some loop where it 
keeps calling io_add() and io_remove(). 

> FD 13 is "anon_inode:[eventpoll]"

What about fd 78? I guess some socket.

Could you also try two more things when it happens again:

ltrace -tt -e '*' -o ltrace.log -p 
(My guess this isn't going to be very useful, but just in case it might be..)

gdb -p 
bt full
quit

Preferably install dovecot-dbg package also so the gdb backtrace output will be 
better.



Re: Auth process sometimes stop responding after upgrade

2018-09-07 Thread Simone Lazzaris
Some more information: the issue has just occurred, again on an instance 
without the 
"service_count = 0" configuration directive on pop3-login.

I've observed that while the issue is occurring, the director process goes 100% 
CPU. I've 
straced the process. It is seemingly looping:

...
...
epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, 
{u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, 
{u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, 
{u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, 
{u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, 
{u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, 
{u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, 
{u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, 
{u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, 
{u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, 
{u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, 
{u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, 
{u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, 
{u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, 
{u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, 
{u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, 
{u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, 
{u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, 
{u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, 
{u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, 
{u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, 
{u32=149035320, u64=149035320}}) = 0
...
...
...

FD 13 is "anon_inode:[eventpoll]"

Killing (with -9) the director process, without stopping or restarting the 
service, restores 
the correct behaviour.



*Simone Lazzaris*
*Qcom S.p.A.*
simone.lazza...@qcom.it[1] | www.qcom.it[2]
* LinkedIn[3]* | *Facebook[4]*
[5] 







[1] mailto:simone.lazza...@qcom.it
[2] https://www.qcom.it
[3] https://www.linkedin.com/company/qcom-spa
[4] http://www.facebook.com/qcomspa
[5] https://www.qcom.it/includes/email-banner.gif


Re: Auth process sometimes stop responding after upgrade

2018-09-07 Thread Aki Tuomi
Try tracing auth-worker process instead, as sql uses worker.
---Aki TuomiDovecot oy
 Original message From: Simone Lazzaris 
 Date: 07/09/2018  12:59  (GMT+02:00) To: Sami 
Ketola  Cc: Dovecot Mailing List , 
Aki Tuomi  Subject: Re: Auth process sometimes stop 
responding after upgrade 

In data venerdì 7 settembre 2018 11:20:49 CEST, Sami Ketola ha scritto:
> > On 7 Sep 2018, at 11.25, Simone Lazzaris  wrote:
> > Actually, I have a poolmon script running that should drop vhost count for
> > unresponsive backends; the strage thing is, the backends are NOT
> > unresponsive, they are working as ususal.
> If it's this one https://github.com/brandond/poolmon/blob/master/poolmon
> <https://github.com/brandond/poolmon/blob/master/poolmon> you are running
> and old version of it as the latest version is more compatible with recent
> dovecot releases.
> 
> current version in git correctly uses HOST-DOWN and HOST-FLUSH instead of
> modifying vhost count.
> 
 
Interesting, I'll surely upgrade the script in the next days. Thanks for the 
hint.
 
But the script is surely not the ultimate cause of the failures: the backend 
(and the script itself) are untouched  - and working - since many moons ago. 
 
The only modified entity is dovecot on the frontends. 
 
And even in the event of some (3 out of 8, in this very case) backends marked 
as failed, the authentication on the frontends should work, shouldn't it?
 
I've tried to strace the auth process during the last failure, and this is what 
I've got:
 
Process 2539 attached - interrupt to quit
gettimeofday({1536308480, 998803}, NULL) = 0
epoll_wait(15, 
 
After about 60 seconds, I've aborted the strace and restarted dovecot to avoid 
upsetting customers. Searching for file descriptor #15 in /proc//fd I found 
"anon_inode:[eventpoll]"

-- 
Simone Lazzaris
Responsabile datacenter

Qcom S.p.A.
Via Roggia Vignola, 9 | 24047 Treviglio (BG)
T +39036347905 | D +3903631970352| M +393938111237
simone.lazza...@qcom.it | www.qcom.it

Qcom Official Pages LinkedIn | Facebook

 





Re: Auth process sometimes stop responding after upgrade

2018-09-07 Thread Simone Lazzaris
In data venerdì 7 settembre 2018 11:20:49 CEST, Sami Ketola ha scritto:
> > On 7 Sep 2018, at 11.25, Simone Lazzaris  wrote:
> > Actually, I have a poolmon script running that should drop vhost count for
> > unresponsive backends; the strage thing is, the backends are NOT
> > unresponsive, they are working as ususal.
> If it's this one https://github.com/brandond/poolmon/blob/master/poolmon
>  you are running
> and old version of it as the latest version is more compatible with recent
> dovecot releases.
> 
> current version in git correctly uses HOST-DOWN and HOST-FLUSH instead of
> modifying vhost count.
> 

Interesting, I'll surely upgrade the script in the next days. Thanks for the 
hint.

But the script is surely not the ultimate cause of the failures: the backend 
(and the script 
itself) are untouched  - and working - since many moons ago. 

The only modified entity is dovecot on the frontends. 

And even in the event of some (3 out of 8, in this very case) backends marked 
as failed, the 
authentication on the frontends should work, shouldn't it?

I've tried to strace the auth process during the last failure, and this is what 
I've got:

Process 2539 attached - interrupt to quit
gettimeofday({1536308480, 998803}, NULL) = 0
epoll_wait(15, 

After about 60 seconds, I've aborted the strace and restarted dovecot to avoid 
upsetting 
customers. Searching for file descriptor #15 in /proc//fd I found 
"anon_inode:
[eventpoll]"


*Simone Lazzaris*
*Qcom S.p.A.*
simone.lazza...@qcom.it[1] | www.qcom.it[2]
* LinkedIn[3]* | *Facebook[4]*
[5] 







[1] mailto:simone.lazza...@qcom.it
[2] https://www.qcom.it
[3] https://www.linkedin.com/company/qcom-spa
[4] http://www.facebook.com/qcomspa
[5] https://www.qcom.it/includes/email-banner.gif


Re: Auth process sometimes stop responding after upgrade

2018-09-07 Thread Simone Lazzaris
In data venerdì 7 settembre 2018 10:06:00 CEST, Sami Ketola ha scritto:
> > On 7 Sep 2018, at 11.00, Simone Lazzaris 
> > wrote:
> > 
> > 
> > The only suspect thing is this:
> > 
> > Sep  6 14:45:41 imap-front13 dovecot: director: doveadm: Host
> > 192.168.1.142
> > vhost count changed from 100 to 0
> > Sep  6 14:45:41 imap-front13 dovecot: director: doveadm: Host
> > 192.168.1.143
> > vhost count changed from 100 to 0
> > Sep  6 14:45:41 imap-front13 dovecot: director: doveadm: Host
> > 192.168.1.219
> > vhost count changed from 100 to 0
> > 
> > Nothing on the other system logs (e.g. kernel, daemon, syslog, messages
> > ).
> Any idea what is changing the vhost count on the backends? Do you have some
> script running that possibly does change the vhost count for some
> triggering event?
> 
> Sami

Actually, I have a poolmon script running that should drop vhost count for 
unresponsive 
backends; the strage thing is, the backends are NOT unresponsive, they are 
working as 
ususal.


*Simone Lazzaris*
*Qcom S.p.A.*
simone.lazza...@qcom.it[1] | www.qcom.it[2]
* LinkedIn[3]* | *Facebook[4]*
[5] 







[1] mailto:simone.lazza...@qcom.it
[2] https://www.qcom.it
[3] https://www.linkedin.com/company/qcom-spa
[4] http://www.facebook.com/qcomspa
[5] https://www.qcom.it/includes/email-banner.gif


Re: Auth process sometimes stop responding after upgrade

2018-09-07 Thread Sami Ketola


> On 7 Sep 2018, at 11.25, Simone Lazzaris  wrote:
> Actually, I have a poolmon script running that should drop vhost count for 
> unresponsive backends; the strage thing is, the backends are NOT 
> unresponsive, they are working as ususal.
> 

If it's this one https://github.com/brandond/poolmon/blob/master/poolmon 
 
you are running and old version of it as the latest version is more compatible 
with recent dovecot releases.

current version in git correctly uses HOST-DOWN and HOST-FLUSH instead of 
modifying vhost count.

But even then I still consider poolmon a bit too aggressive in marking hosts 
down. It does HOST-DOWN and HOST-FLUSH already after first scan failure. 
Maybe there should be some grace period and wait for few failed polls in 
certain period before doing it in case there is some temporary overload or 
networking failure.

Sami






Re: Auth process sometimes stop responding after upgrade

2018-09-07 Thread Simone Lazzaris
In data venerdì 7 settembre 2018 10:06:00 CEST, Sami Ketola ha scritto:
> > On 7 Sep 2018, at 11.00, Simone Lazzaris 
> > wrote:
> > 
> > 
> > The only suspect thing is this:
> > 
> > Sep  6 14:45:41 imap-front13 dovecot: director: doveadm: Host
> > 192.168.1.142
> > vhost count changed from 100 to 0
> > Sep  6 14:45:41 imap-front13 dovecot: director: doveadm: Host
> > 192.168.1.143
> > vhost count changed from 100 to 0
> > Sep  6 14:45:41 imap-front13 dovecot: director: doveadm: Host
> > 192.168.1.219
> > vhost count changed from 100 to 0
> > 
> > Nothing on the other system logs (e.g. kernel, daemon, syslog, messages
> > ).
> Any idea what is changing the vhost count on the backends? Do you have some
> script running that possibly does change the vhost count for some
> triggering event?
> 
> Sami

Actually, I have a poolmon script running that should drop vhost count for 
unresponsive backends; the strage thing is, the backends are NOT unresponsive, 
they are working as ususal.


To be honest, the configuration that keep failing (again, right now) has imap-
login not configured as I've told you. The configuration you saw was an 
attempt to overcame the problem, in which I've set pop3-login like this:

service pop3-login {
  executable = imap-login director
  service_count = 0
  vsz_limit = 128 M
}

But the machines that fails are like this:

service pop3-login {
  executable = pop3-login director
}

I've got 4 VM configured in the ring. The one with service_count=0 has (so 
far) never failed, the other 3 has failed multiple time (every few hours).

-- 
*Simone Lazzaris*
*Qcom S.p.A.*





Re: Auth process sometimes stop responding after upgrade

2018-09-07 Thread Sami Ketola



> On 7 Sep 2018, at 11.00, Simone Lazzaris  wrote:
> 
> 
> The only suspect thing is this:
> 
> Sep  6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.142 
> vhost count changed from 100 to 0
> Sep  6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.143 
> vhost count changed from 100 to 0
> Sep  6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.219 
> vhost count changed from 100 to 0
> 
> Nothing on the other system logs (e.g. kernel, daemon, syslog, messages ).

Any idea what is changing the vhost count on the backends? Do you have some 
script running that possibly does change the vhost count for some triggering 
event?

Sami



Re: Auth process sometimes stop responding after upgrade

2018-09-07 Thread Simone Lazzaris
In data venerdì 7 settembre 2018 09:39:23 CEST, Aki Tuomi ha scritto:
> Do you see anything else in logs? 
> 

Just the usual proxy messages, like these:

Sep  6 14:45:41 imap-front13 dovecot: imap-login: 
proxy(alessan...@soldinicarrelli.it): disconnecting 5.169.233.224 
(Disconnected by server(0s idle, in=451, out=3556)): 
user=, method=PLAIN, rip=5.169.233.224, 
lip=80.76.81.92, TLS, session=

Sep  6 14:45:51 imap-front13 dovecot: pop3-login: proxy(p...@pedrini.it): 
started proxying to 192.168.1.144:110: user=, method=DIGEST-
MD5, rip=5.34.191.10, lip=80.76.81.92, TLS, session=


The only suspect thing is this:

Sep  6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.142 
vhost count changed from 100 to 0
Sep  6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.143 
vhost count changed from 100 to 0
Sep  6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.219 
vhost count changed from 100 to 0

Nothing on the other system logs (e.g. kernel, daemon, syslog, messages ).

-- 
*Simone Lazzaris*
*Qcom S.p.A.*





Re: Auth process sometimes stop responding after upgrade

2018-09-07 Thread Aki Tuomi
Do you see anything else in logs? 


---Aki TuomiDovecot oy
 Original message From: Simone Lazzaris 
 Date: 07/09/2018  10:13  (GMT+02:00) To: 
dovecot@dovecot.org Subject: Auth process sometimes stop responding after 
upgrade 
Hi all;
I've upgraded a ring of dovecot directors from 2.2.15 to 2.2.36. After the 
upgrade I've got some instability: a few time per day per server, seemly at 
random, the auth process stop responding and the clients cannot authenticate 
any more:

Sep  6 14:45:51 imap-front13 dovecot: pop3-login: Warning: Auth process not 
responding, delayed sending initial response (greeting): user=<>, 
rip=xx.xx.xx.xx, lip=xx.xx.xx.xx, TLS, session=<1HVqRTN10MNTie+S>

I can't figure it out Any hints?

This is my configuration:


# 2.2.36 (1f10bfa63): /usr/local/etc/dovecot/dovecot.conf
# OS: Linux 3.2.0-4-686-pae i686 Debian 7.11 
# Hostname: imap-front13.mailfarm.interac.it
auth_mechanisms = plain login digest-md5 cram-md5 apop scram-sha-1
auth_verbose = yes
auth_verbose_passwords = plain
auth_worker_max_count = 50
base_dir = /var/run/dovecot/
default_login_user = nobody
director_doveadm_port = 9091
director_mail_servers = 192.168.1.142 192.168.1.143 192.168.1.144 
192.168.1.145 192.168.1.216 192.168.1.217 192.168.1.218 192.168.1.219
director_servers = 212.183.164.157 212.183.164.158 212.183.164.159 
212.183.164.160
listen = *
passdb {
  args = /usr/local/etc/dovecot/sql.conf
  driver = sql
}
protocols = imap pop3
service director {
  fifo_listener login/proxy-notify {
    mode = 0666
  }
  inet_listener {
    port = 9090
  }
  unix_listener director-userdb {
    mode = 0600
  }
  unix_listener login/director {
    mode = 0666
  }
}
service imap-login {
  executable = imap-login director
  service_count = 0
  vsz_limit = 128 M
}
service pop3-login {
  executable = pop3-login director
  service_count = 0
  vsz_limit = 128 M
}
ssl_cert =