Re: warning: NFS hangs with dovecot 2.3.8 on Debian buster

2019-11-01 Thread Jan-Pieter Cornet via dovecot

On 25-10-19 19:41, Jan-Pieter Cornet via dovecot wrote:

We are in the process of contacting the linux-nfs folks about this, but I'm 
trying to reproduce this on a test-cluster first, to be able to present a 
well-documented case. Since this hang doesn't happen immediately, but takes a 
few hours to a day to occur in the wild, or a few thousand writes to the same 
mailbox, it's a bit hard to debug.


Just FTR, I finally sent mail to the linux-nfs list about this. See eg 
https://marc.info/?l=linux-nfs=157260601632323=2

No replies yet - if^H^Hwhen this gets resolved I'll report back to this list.

--
Jan-Pieter Cornet 
Systeembeheer XS4ALL Internet bv
www.xs4all.nl




signature.asc
Description: OpenPGP digital signature


warning: NFS hangs with dovecot 2.3.8 on Debian buster

2019-10-25 Thread Jan-Pieter Cornet via dovecot

A warning to those considering to upgrade to Debian 10 (buster): we have seen 
occasional NFS hangs with dovecot when using the stock debian buster kernel 
(4.19.67-2+deb10u1).

When we downgrade to the debian stretch kernel (4.9.189-3+deb9u1), the issue 
does not occur. Note that we *only* downgraded the kernel, the rest of the OS 
is still debian buster. Dovecot 2.3.8.

A little more info: we have a dovecot cluster, using mdbox for storage, on an 
NFS server (netapp Cmode version 9.6P2). We use a dovecot director layer, so a 
user is always connected to the same back-end dovecot server. The NFS hang 
occurs on the back-end server.

Once the process hangs, other processes trying to write to the same mailbox, 
will get an error like this:

Timeout (180s) while waiting for lock for transaction log file 
/var/mail/.../index/storage/dovecot.map.index.log (WRITE lock held by pid )

The stuck process itself doesn't seem to do anything, is stuck in "D" disk state, 
"strace" doesn't show anything (and after attaching, strace itself needs a kill -KILL to 
stop). The only way to unwedge the process seems to be to do a kill -KILL of the stuck process. 
Reading from the mailbox is still possible.

We are in the process of contacting the linux-nfs folks about this, but I'm 
trying to reproduce this on a test-cluster first, to be able to present a 
well-documented case. Since this hang doesn't happen immediately, but takes a 
few hours to a day to occur in the wild, or a few thousand writes to the same 
mailbox, it's a bit hard to debug.

--
Jan-Pieter Cornet 
Systeembeheer XS4ALL Internet bv
www.xs4all.nl




signature.asc
Description: OpenPGP digital signature