Hi Francesco, On Wed, Jan 22, 2025 at 09:56:26AM +0100, Salvatore Bonaccorso wrote: > Control: tags -1 + unreproducible moreinfo > > On Wed, Jan 22, 2025 at 12:29:12AM +0100, Francesco Poli (wintermute) wrote: > > Package: nfs-kernel-server > > Version: 1:2.8.2-1+b1 > > Severity: grave > > Justification: causes non-serious data loss > > X-Debbugs-Cc: [email protected] > > > > > > Dear maintainers, > > I encountered a big issue, while upgrading package 'nfs-kernel-server' > > on the box where the NFS server runs (the clients run on the compute > > nodes of an HPC cluster). > > > > The upgrade: > > > > [UPGRADE] nfs-kernel-server:amd64 1:2.8.2-1 -> 1:2.8.2-1+b1 > > > > got stuck at > > > > [...] > > Setting up nfs-kernel-server (1:2.8.2-1+b1) ... > > > > > > > > It looks like it was stuck at the restart of the systemd service: > > > > # systemctl status nfs-kernel-server.service > > ● nfs-server.service - NFS server and services > > Loaded: loaded (/usr/lib/systemd/system/nfs-server.service; enabled; > > prese> > > Drop-In: /run/systemd/generator/nfs-server.service.d > > └─order-with-mounts.conf > > Active: activating (start-pre) since Tue 2025-01-21 12:40:52 CET; > > 10min ago > > Job: 97667 > > Invocation: ced460d410fe4059b9e8781b35340d70 > > Docs: man:rpc.nfsd(8) > > man:exportfs(8) > > Cntrl PID: 249039 (exportfs) > > Tasks: 3 (limit: 154102) > > Memory: 680K (peak: 2.5M) > > CPU: 10ms > > CGroup: /system.slice/nfs-server.service > > ├─239857 /usr/sbin/nfsdctl threads 0 > > ├─239918 /usr/sbin/exportfs -au > > └─249039 /usr/sbin/exportfs -r > > > > There was a 'nfsdctl' process in uninterruptible sleep (D): > > > > $ ps -eldaf | grep nf[s] > > 4 D root 239857 1 0 80 0 - 847 - 12:07 ? > > 00:00:00 /usr/sbin/nfsdctl threads 0 > > 5 S root 247511 1 0 80 0 - 1375 - 12:35 ? > > 00:00:00 /usr/sbin/nfsdcld > > > > After about 30 min, since trying to kill PID 239857 obviously had no effect, > > and I could not find any other strategy to restart > > nfs-kernel-server.service, > > I had to reboot the box, thus causing many problems to all the NFS clients. > > > > After reboot, I could issue: > > > > # aptitude --purge-unused safe-upgrade > > > > which finally completed the upgrade (fixing the nfs-kernel-server package, > > which was left in a partially configured state). > > > > > > I have never seen anything like this before, and I have upgraded > > nfs-kernel-server and related packages on Debian machines for quite > > a long time. > > Anyway, this should *not* happen during a system upgrade with > > aptitude or apt! > > > > I don't know whether bug [#992661] is related or not. > > > > [#992661]: <https://bugs.debian.org/992661> > > > > By looking at /var/log/kern.log , I see that a kernel BUG was traced > > at the time when the 'nfsdctl' process got stuck in D state. > > See the attached kern.log snippet. > > > > Please investigate and fix the issue as soon as possible. > > I really hope we can prevent this from happening again! > > > > Thanks for your time and dedication. > > So I'm not able to reproduce this on a current Debian unstable system > mimicking the upgrade. *But* it is possible we have some races > somehwere as recently discussed at our regular kernel team meeting. > > We need first to find a way to trigger the issue in any case.
Upstream got an idea on what the problem is and posted a patch. https://lore.kernel.org/linux-nfs/[email protected]/ Regards, Salvatore

