It is possible that I'm seeing the same problem. Our AMD Opteron 4386 (16
cores) machine is also getting stuck with lots of hung tasks.
Although it responds to ping, and even a KVM virtual machine running on it
appears to continue working correctly, the host itself is locked up. This
happens once a week - probably when the machine is under the most direct
CPU load and NFS load.
Once the machine is in this state I can type in a username at the login
prompt but no password prompt ever appears.
I forced a crashdump and it contained hundreds of tasks with backtraces
involving a mutex_lock in walk_component or nfsd_lookup_dentry which look
similar to Alexander's:
PID: 499 TASK: ffff880490a29080 CPU: 11 COMMAND: "nrpe"
#0 [ffff880454e099a8] __schedule at ffffffff8134f195
#1 [ffff880454e09a30] __mutex_lock_common.isra.5 at ffffffff8134fb74
#2 [ffff880454e09aa0] mutex_lock at ffffffff8134fa62
#3 [ffff880454e09ac0] walk_component at ffffffff81103868
#4 [ffff880454e09b30] link_path_walk at ffffffff811040c1
#5 [ffff880454e09bc0] path_openat at ffffffff8110611d
#6 [ffff880454e09c50] do_filp_open at ffffffff8110646d
#7 [ffff880454e09d20] open_exec at ffffffff810fed80
#8 [ffff880454e09d40] load_elf_binary at ffffffff81135939
#9 [ffff880454e09e50] search_binary_handler at ffffffff810ff7fd
#10 [ffff880454e09ea0] do_execve_common.isra.24 at ffffffff81100551
#11 [ffff880454e09f10] sys_execve at ffffffff81014dd2
#12 [ffff880454e09f50] stub_execve at ffffffff813559ec
RIP: 00007fcc8991ca87 RSP: 00007fffe8b91ef8 RFLAGS: 00000246
RAX: 000000000000003b RBX: 0000000000000003 RCX: ffffffffffffffff
RDX: 000000000164d180 RSI: 00007fffe8b91f10 RDI: 00007fcc899bc3ad
RBP: 0000000000000003 R8: 0000000000000000 R9: 00000000000001f2
R10: 00007fcc8a88f9d0 R11: 0000000000000246 R12: 00007fffe8b91f10
R13: 0000000000000400 R14: 0000000000000001 R15: 00007fffe8b91f10
ORIG_RAX: 000000000000003b CS: 0033 SS: 002b
and:
PID: 4087 TASK: ffff88040ea63840 CPU: 2 COMMAND: "nfsd"
#0 [ffff8804034b9c00] __schedule at ffffffff8134f195
#1 [ffff8804034b9c88] __mutex_lock_common.isra.5 at ffffffff8134fb74
#2 [ffff8804034b9cf8] mutex_lock at ffffffff8134fa62
#3 [ffff8804034b9d18] fh_lock_nested.isra.6 at ffffffffa043d63c [nfsd]
#4 [ffff8804034b9d28] nfsd_lookup_dentry at ffffffffa043df1a [nfsd]
#5 [ffff8804034b9d98] nfsd4_secinfo.part.15 at ffffffffa0447692 [nfsd]
#6 [ffff8804034b9dc8] nfsd4_proc_compound at ffffffffa04468d6 [nfsd]
#7 [ffff8804034b9e18] nfsd_dispatch at ffffffffa043a7cd [nfsd]
#8 [ffff8804034b9e48] svc_process_common at ffffffffa0336c3f [sunrpc]
#9 [ffff8804034b9eb8] svc_process at ffffffffa0337050 [sunrpc]
#10 [ffff8804034b9ed8] nfsd at ffffffffa043a0e3 [nfsd]
#11 [ffff8804034b9ef8] kthread at ffffffff8105f701
#12 [ffff8804034b9f48] kernel_thread_helper at ffffffff813576f4
and:
PID: 5013 TASK: ffff880805c8b180 CPU: 8 COMMAND: "getty"
#0 [ffff88080cb8b9a8] __schedule at ffffffff8134f195
#1 [ffff88080cb8ba30] __mutex_lock_common.isra.5 at ffffffff8134fb74
#2 [ffff88080cb8baa0] mutex_lock at ffffffff8134fa62
#3 [ffff88080cb8bac0] walk_component at ffffffff81103868
#4 [ffff88080cb8bb30] link_path_walk at ffffffff811040c1
#5 [ffff88080cb8bbc0] path_openat at ffffffff8110611d
#6 [ffff88080cb8bc50] do_filp_open at ffffffff8110646d
#7 [ffff88080cb8bd20] open_exec at ffffffff810fed80
#8 [ffff88080cb8bd40] load_elf_binary at ffffffff81135939
#9 [ffff88080cb8be50] search_binary_handler at ffffffff810ff7fd
#10 [ffff88080cb8bea0] do_execve_common.isra.24 at ffffffff81100551
#11 [ffff88080cb8bf10] sys_execve at ffffffff81014dd2
#12 [ffff88080cb8bf50] stub_execve at ffffffff813559ec
RIP: 00007f0d1ed74a87 RSP: 00007fffab157528 RFLAGS: 00000206
RAX: 000000000000003b RBX: 0000000000000000 RCX: ffffffffffffffff
RDX: 00007fffab159ee8 RSI: 00007fffab157600 RDI: 0000000000405d7c
RBP: 0000000000000003 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000206 R12: 00000000006075a0
R13: 00000000011da750 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: 000000000000003b CS: 0033 SS: 002b
ii linux-image-amd64 3.2+46
ii nfs-kernel-server 1:1.2.6-4
Mike.
--
To UNSUBSCRIBE, email to [email protected]
with a subject of "unsubscribe". Trouble? Contact [email protected]
Archive: https://lists.debian.org/[email protected]