Hi All,
for the first time, we had 1.6 clients participating in our server stress test
procedure, and it didn't work well.
Workload:
- 128 concurrent jobs on 66 client machines
- each job writes four files of ~700 MB, sequentially, using cp
Server: OpenAFS 1.4.14, Scientific Linux 5.7, amd64, two bonded GbE links
Clients:
- 64 x OpenAFS 1.4.14 (+ gerrit 3286), Scientific Linux 5.7, amd64, cache on
ext3
- 2 x OpenAFS 1.6.0 (unpatched), Scientific Linux 6.1, amd64, cache on ext4
(kernel 2.6.32-131.17.1.el6, module was built against GA kernel
2.6.32-71.el6, kABI deps are met)
Results:
- the 126 jobs running on the 64 1.4.14 clients finished successfully
- the two jobs on the two 1.6.0 clients timed out while writing the first
file, and both clients are now stuck
Error messages from client 2 (695 MB of the first file made it to the server):
cp: writing `iso_0001_00000001': No such file or directory
cp: closing `iso_0001_00000001': Connection timed out
failed to write iso_0001_00000001
Corresponding syslog messages:
10:02:11 Lost contact with file server
10:02:11 Lost contact with file server
10:02:12 failed to store file (110)
10:02:22 file server is back up
10:02:22 file server is back up
10:02:32 Lost contact with file server
10:02:32 Lost contact with file server
10:02:52 file server is back up
10:02:52 file server is back up
Error messages from client 1 (335 MB of the first file made it to the server):
cp: closing `iso_0002_00000001': Connection timed out
failed to write iso_0002_00000001
Corresponding syslog messages:
09:56:49 Lost contact with file server
09:56:49 Lost contact with file server
09:56:53 blade80 kernel: afs: failed to store file (110)
09:57:18 file server is back up
09:57:18 file server is back up
Alt-SySRq-T logs this for a hanging /bin/ls /afs:
ls S 0000000000000007 0 5518 5333 0x00000080
ffff88041afdbc08 0000000000000082 0000000000000000 ffffffff8111fb61
ffff880000029d80 ffff880400000002 ffff88041afdbc48 00000001681563f8
ffff8804169c3af8 ffff88041afdbfd8 000000000000f598 ffff8804169c3af8
Call Trace:
[<ffffffff8111fb61>] ? __alloc_pages_nodemask+0x111/0x8b0
[<ffffffffa0403c16>] afs_osi_SleepSig+0xf6/0x1a0 [openafs]
[<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
[<ffffffffa0403d31>] afs_osi_Sleep+0x71/0xc0 [openafs]
[<ffffffffa0415a43>] afs_GetDCache+0x1923/0x2260 [openafs]
[<ffffffff811377b7>] ? handle_pte_fault+0xf7/0xb50
[<ffffffffa045af19>] afs_linux_readdir+0x169/0xa60 [openafs]
[<ffffffff811860f0>] ? filldir+0x0/0xe0
[<ffffffff811860f0>] ? filldir+0x0/0xe0
[<ffffffff81186370>] vfs_readdir+0xc0/0xe0
[<ffffffff811864f9>] sys_getdents+0x89/0xf0
[<ffffffff8100b172>] system_call_fastpath+0x16/0x1b
There are also many of those (from a system maintenance task scheduled by cron):
perl S 0000000000000004 0 2607 2606 0x00000080
ffff88041704b718 0000000000000086 0000000000000000 0000000000012b40
000000001704a000 ffff88041704b6a8 ffffffff8112eff9 000000016707a731
ffff88041cbd1af8 ffff88041704bfd8 000000000000f598 ffff88041cbd1af8
Call Trace:
[<ffffffff8112eff9>] ? zone_statistics+0x99/0xc0
[<ffffffffa0403c16>] afs_osi_SleepSig+0xf6/0x1a0 [openafs]
[<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
[<ffffffffa0403d31>] afs_osi_Sleep+0x71/0xc0 [openafs]
[<ffffffffa0415a43>] afs_GetDCache+0x1923/0x2260 [openafs]
[<ffffffffa043cdaf>] afs_lookup+0x3df/0x1cf0 [openafs]
[<ffffffffa042ce4f>] ? afs_FindVCache+0x2ff/0x5b0 [openafs]
[<ffffffffa0430f0e>] ? afs_GetVCache+0x7e/0x590 [openafs]
[<ffffffffa045dc5a>] afs_linux_dentry_revalidate+0x3ba/0x540 [openafs]
[<ffffffffa042cc32>] ? afs_FindVCache+0xe2/0x5b0 [openafs]
[<ffffffff814dc34e>] ? mutex_lock+0x1e/0x50
[<ffffffffa042d269>] ? afs_PutVCache+0x79/0x140 [openafs]
[<ffffffff8118a7cd>] ? __d_lookup+0xad/0x150
[<ffffffff811806aa>] do_lookup+0x5a/0x220
[<ffffffff81180dd9>] __link_path_walk+0x569/0x820
[<ffffffff81181228>] do_follow_link+0x198/0x440
[<ffffffff81180a7b>] __link_path_walk+0x20b/0x820
[<ffffffff8118175a>] path_walk+0x6a/0xe0
[<ffffffff8118192b>] do_path_lookup+0x5b/0xa0
[<ffffffff81173e61>] ? get_empty_filp+0xa1/0x170
[<ffffffff8118259b>] do_filp_open+0xfb/0xd90
[<ffffffff81041594>] ? __do_page_fault+0x1e4/0x480
[<ffffffff811911f0>] ? mntput_no_expire+0x30/0x110
[<ffffffff8118f1e2>] ? alloc_fd+0x92/0x160
[<ffffffff8116f989>] do_sys_open+0x69/0x140
[<ffffffff8116faa0>] sys_open+0x20/0x30
[<ffffffff8100b172>] system_call_fastpath+0x16/0x1b
Sometimes the afs_osi_SleepSig is the first line in the call trace, but in most
cases it's preceded by "? zone_statistics" as above.
The client still responds to rxdebug and cmdebug.
This is probably easy to reproduce in our environment, but I'm leaving the
clients in this state for the moment, just in case there's more data to collect.
Any ideas?
Thanks,
Stephan
bcc: to openafs-bugs
--
Stephan Wiesand
DESY -DV-
Platanenallee 6
15738 Zeuthen, Germany
_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info