It's usually very troublesome to try to reproduce and debug these.. It looks like the directors got into some kind of a broken state that wouldn't fix itself. Some host update going around and around in the director ring without being detected as a duplicate/obsolete update. For now I think the only workaround is to do what you did: stop all directors (or maybe a single one could have been left running) and start them up again.
> On 21 Apr 2017, at 18.50, Mark Moseley <[email protected]> wrote: > > Timo/Aki/Docecot guys, any hints here? Is this a bug? Design issue? > > On Fri, Apr 7, 2017 at 10:10 AM Mark Moseley <[email protected]> wrote: > >> On Mon, Apr 3, 2017 at 6:04 PM, Mark Moseley <[email protected]> >> wrote: >> >>> We just had a bunch of backend boxes go down due to a DDoS in our >>> director cluster. When the DDoS died down, our director ring was a mess. >>> >>> Each box had thousands (and hundreds per second, which is a bit much) of >>> log lines like the following: >>> >>> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host >>> 10.1.17.15 is being updated before previous update had finished (up -> >>> down) - setting to state=down vhosts=100 >>> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host >>> 10.1.17.15 is being updated before previous update had finished (down -> >>> up) - setting to state=up vhosts=100 >>> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host >>> 10.1.17.15 is being updated before previous update had finished (up -> >>> down) - setting to state=down vhosts=100 >>> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host >>> 10.1.17.15 is being updated before previous update had finished (down -> >>> up) - setting to state=up vhosts=100 >>> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host >>> 10.1.17.15 is being updated before previous update had finished (up -> >>> down) - setting to state=down vhosts=100 >>> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host >>> 10.1.17.15 is being updated before previous update had finished (down -> >>> up) - setting to state=up vhosts=100 >>> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host >>> 10.1.17.15 is being updated before previous update had finished (up -> >>> down) - setting to state=down vhosts=100 >>> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host >>> 10.1.17.15 is being updated before previous update had finished (down -> >>> up) - setting to state=up vhosts=100 >>> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host >>> 10.1.17.15 is being updated before previous update had finished (up -> >>> down) - setting to state=down vhosts=100 >>> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host >>> 10.1.17.15 is being updated before previous update had finished (down -> >>> up) - setting to state=up vhosts=100 >>> >>> This was on every director box and the status of all of the directors in >>> 'doveadm director ring status' was 'handshaking'. >>> >>> Here's a sample packet between directors: >>> >>> 19:51:23.552280 IP 10.1.20.10.56670 > 10.1.20.1.9090: Flags [P.], seq >>> 4147:5128, ack 0, win 0, options [nop,nop,TS val 1373505883 ecr >>> 1721203906], length 981 >>> >>> Q. [f.|.HOST 10.1.20.10 9090 1006732 10.1.17.15 >>> 100 D1491260800 >>> HOST 10.1.20.10 9090 1006733 10.1.17.15 100 >>> U1491260800 >>> HOST 10.1.20.10 9090 1006734 10.1.17.15 100 >>> D1491260800 >>> HOST 10.1.20.10 9090 1006735 10.1.17.15 100 >>> U1491260800 >>> HOST 10.1.20.10 9090 1006736 10.1.17.15 100 >>> D1491260800 >>> HOST 10.1.20.10 9090 1006737 10.1.17.15 100 >>> U1491260800 >>> HOST 10.1.20.10 9090 1006738 10.1.17.15 100 >>> D1491260800 >>> HOST 10.1.20.10 9090 1006739 10.1.17.15 100 >>> U1491260800 >>> HOST 10.1.20.10 9090 1006740 10.1.17.15 100 >>> D1491260800 >>> HOST 10.1.20.10 9090 1006741 10.1.17.15 100 >>> U1491260800 >>> HOST 10.1.20.10 9090 1006742 10.1.17.15 100 >>> D1491260800 >>> HOST 10.1.20.10 9090 1006743 10.1.17.15 100 >>> U1491260800 >>> HOST 10.1.20.10 9090 1006744 10.1.17.15 100 >>> D1491260800 >>> HOST 10.1.20.10 9090 1006745 10.1.17.15 100 >>> U1491260800 >>> HOST 10.1.20.10 9090 1006746 10.1.17.15 100 >>> D1491260800 >>> HOST 10.1.20.10 9090 1006747 10.1.17.15 100 >>> U1491260800 >>> SYNC 10.1.20.10 9090 1011840 7 1491263483 3377546382 >>> >>> I'm guessing that D1491260800 is the user hash (with D for down), and the >>> U version is for 'up'. >>> >>> I'm happy to provide the full tcpdump (and/or doveconf -a), though the >>> tcpdump is basically all identical the one I pasted (same hash, same host). >>> >>> This seems pretty fragile. There should be some sort of tie break for >>> that, instead of bringing the entire cluster to its knees. Or just drop the >>> backend host completely. Or something, anything besides hosing things >>> pretty badly. >>> >>> This is 2.2.27, on both the directors and backend. If the answer is >>> upgrade to 2.2.28, then I'll upgrade immediately. I see >>> commit a9ade104616bbb81c34cc6f8bfde5dab0571afac mentions the same error but >>> the commit predates 2.2.27 by a month and a half. >>> >>> In the meantime, is there any doveadm command I could've done to fix >>> this? I tried removing the host (doveadm director remove 10.1.17.15) but >>> that didn't do anything. I didn't think to try to flush the mapping for >>> that user till just now. I suspect that with the ring unsync'd, flushing >>> the user wouldn't have helped. >>> >>> The only remedy was to kill dovecot on every box in the director cluster >>> and then (with dovecot down on *all* of them) start dovecot back up. >>> Restarting each director's dovecot (with other directors' dovecots still >>> running) did nothing. Only by brining the entire cluster down did dovecot >>> stop furiously logging "Host ... is being updated before previous update >>> had finished" on every director host. >>> >>> Any advice is most welcome. >>> >> >> >> Dovecot guys, >> >> Any thing I can do to fix this? Any thoughts on that error in general? >> >>
