A warning to those considering to upgrade to Debian 10 (buster): we have seen
occasional NFS hangs with dovecot when using the stock debian buster kernel
(4.19.67-2+deb10u1).
When we downgrade to the debian stretch kernel (4.9.189-3+deb9u1), the issue
does not occur. Note that we *only* downgraded the kernel, the rest of the OS
is still debian buster. Dovecot 2.3.8.
A little more info: we have a dovecot cluster, using mdbox for storage, on an
NFS server (netapp Cmode version 9.6P2). We use a dovecot director layer, so a
user is always connected to the same back-end dovecot server. The NFS hang
occurs on the back-end server.
Once the process hangs, other processes trying to write to the same mailbox,
will get an error like this:
Timeout (180s) while waiting for lock for transaction log file
/var/mail/.../index/storage/dovecot.map.index.log (WRITE lock held by pid )
The stuck process itself doesn't seem to do anything, is stuck in "D" disk state,
"strace" doesn't show anything (and after attaching, strace itself needs a kill -KILL to
stop). The only way to unwedge the process seems to be to do a kill -KILL of the stuck process.
Reading from the mailbox is still possible.
We are in the process of contacting the linux-nfs folks about this, but I'm
trying to reproduce this on a test-cluster first, to be able to present a
well-documented case. Since this hang doesn't happen immediately, but takes a
few hours to a day to occur in the wild, or a few thousand writes to the same
mailbox, it's a bit hard to debug.
--
Jan-Pieter Cornet
Systeembeheer XS4ALL Internet bv
www.xs4all.nl
signature.asc
Description: OpenPGP digital signature