[OpenAFS] 1.6 clients stuck after timeout writing to an overloaded server

Stephan Wiesand Tue, 15 Nov 2011 07:35:06 -0800

Hi All,

for the first time, we had 1.6 clients participating in our server stress test 
procedure, and it didn't work well.


Workload:
 - 128 concurrent jobs on 66 client machines 
 - each job writes four files of ~700 MB, sequentially, using cp

Server: OpenAFS 1.4.14, Scientific Linux 5.7, amd64, two bonded GbE links
Clients:
 -  64 x OpenAFS 1.4.14 (+ gerrit 3286), Scientific Linux 5.7, amd64, cache on 
ext3
 -   2 x OpenAFS 1.6.0  (unpatched), Scientific Linux 6.1, amd64, cache on ext4
     (kernel 2.6.32-131.17.1.el6, module was built against GA kernel 
2.6.32-71.el6, kABI deps are met)

Results:
 - the 126 jobs running on the 64 1.4.14 clients finished successfully
 - the two jobs on the two 1.6.0 clients timed out while writing the first 
file, and both clients are now stuck

Error messages from client 2 (695 MB of the first file made it to the server):

cp: writing `iso_0001_00000001': No such file or directory
cp: closing `iso_0001_00000001': Connection timed out
failed to write iso_0001_00000001

Corresponding syslog messages:

10:02:11 Lost contact with file server
10:02:11 Lost contact with file server
10:02:12 failed to store file (110)
10:02:22 file server is back up
10:02:22 file server is back up
10:02:32 Lost contact with file server
10:02:32 Lost contact with file server
10:02:52 file server is back up
10:02:52 file server is back up

Error messages from client 1 (335 MB of the first file made it to the server):

cp: closing `iso_0002_00000001': Connection timed out
failed to write iso_0002_00000001

Corresponding syslog messages:

09:56:49 Lost contact with file server
09:56:49 Lost contact with file server
09:56:53 blade80 kernel: afs: failed to store file (110)
09:57:18 file server is back up
09:57:18 file server is back up


Alt-SySRq-T logs this for a hanging /bin/ls /afs:

ls            S 0000000000000007     0  5518   5333 0x00000080
ffff88041afdbc08 0000000000000082 0000000000000000 ffffffff8111fb61
ffff880000029d80 ffff880400000002 ffff88041afdbc48 00000001681563f8
ffff8804169c3af8 ffff88041afdbfd8 000000000000f598 ffff8804169c3af8
Call Trace:
[<ffffffff8111fb61>] ? __alloc_pages_nodemask+0x111/0x8b0
[<ffffffffa0403c16>] afs_osi_SleepSig+0xf6/0x1a0 [openafs]
[<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
[<ffffffffa0403d31>] afs_osi_Sleep+0x71/0xc0 [openafs]
[<ffffffffa0415a43>] afs_GetDCache+0x1923/0x2260 [openafs]
[<ffffffff811377b7>] ? handle_pte_fault+0xf7/0xb50
[<ffffffffa045af19>] afs_linux_readdir+0x169/0xa60 [openafs]
[<ffffffff811860f0>] ? filldir+0x0/0xe0
[<ffffffff811860f0>] ? filldir+0x0/0xe0
[<ffffffff81186370>] vfs_readdir+0xc0/0xe0
[<ffffffff811864f9>] sys_getdents+0x89/0xf0
[<ffffffff8100b172>] system_call_fastpath+0x16/0x1b

There are also many of those (from a system maintenance task scheduled by cron):

perl          S 0000000000000004     0  2607   2606 0x00000080
ffff88041704b718 0000000000000086 0000000000000000 0000000000012b40
000000001704a000 ffff88041704b6a8 ffffffff8112eff9 000000016707a731
ffff88041cbd1af8 ffff88041704bfd8 000000000000f598 ffff88041cbd1af8
Call Trace:
[<ffffffff8112eff9>] ? zone_statistics+0x99/0xc0
[<ffffffffa0403c16>] afs_osi_SleepSig+0xf6/0x1a0 [openafs]
[<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
[<ffffffffa0403d31>] afs_osi_Sleep+0x71/0xc0 [openafs]
[<ffffffffa0415a43>] afs_GetDCache+0x1923/0x2260 [openafs]
[<ffffffffa043cdaf>] afs_lookup+0x3df/0x1cf0 [openafs]
[<ffffffffa042ce4f>] ? afs_FindVCache+0x2ff/0x5b0 [openafs]
[<ffffffffa0430f0e>] ? afs_GetVCache+0x7e/0x590 [openafs]
[<ffffffffa045dc5a>] afs_linux_dentry_revalidate+0x3ba/0x540 [openafs]
[<ffffffffa042cc32>] ? afs_FindVCache+0xe2/0x5b0 [openafs]
[<ffffffff814dc34e>] ? mutex_lock+0x1e/0x50
[<ffffffffa042d269>] ? afs_PutVCache+0x79/0x140 [openafs]
[<ffffffff8118a7cd>] ? __d_lookup+0xad/0x150
[<ffffffff811806aa>] do_lookup+0x5a/0x220
[<ffffffff81180dd9>] __link_path_walk+0x569/0x820
[<ffffffff81181228>] do_follow_link+0x198/0x440
[<ffffffff81180a7b>] __link_path_walk+0x20b/0x820
[<ffffffff8118175a>] path_walk+0x6a/0xe0
[<ffffffff8118192b>] do_path_lookup+0x5b/0xa0
[<ffffffff81173e61>] ? get_empty_filp+0xa1/0x170
[<ffffffff8118259b>] do_filp_open+0xfb/0xd90
[<ffffffff81041594>] ? __do_page_fault+0x1e4/0x480
[<ffffffff811911f0>] ? mntput_no_expire+0x30/0x110
[<ffffffff8118f1e2>] ? alloc_fd+0x92/0x160
[<ffffffff8116f989>] do_sys_open+0x69/0x140
[<ffffffff8116faa0>] sys_open+0x20/0x30
[<ffffffff8100b172>] system_call_fastpath+0x16/0x1b

Sometimes the afs_osi_SleepSig is the first line in the call trace, but in most 
cases it's preceded by "? zone_statistics" as above. 

The client still responds to rxdebug and cmdebug.

This is probably easy to reproduce in our environment, but I'm leaving the 
clients in this state for the moment, just in case there's more data to collect.

Any ideas?

Thanks,
        Stephan

bcc: to openafs-bugs

-- 
Stephan Wiesand
DESY -DV-
Platanenallee 6
15738 Zeuthen, Germany



_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

[OpenAFS] 1.6 clients stuck after timeout writing to an overloaded server

Reply via email to