hello,

this past weekend our webserver, which serves pages from AFS, crashed and I found several messages like the following in /var/log/messages:

Jun 18 13:19:51 web1 kernel: INFO: task httpd:26383 blocked for more than 120 seconds. Jun 18 13:19:51 web1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 18 13:19:51 web1 kernel: httpd D 0001B845 2032 26383 32143 26384 26382 (NOTLB) Jun 18 13:19:51 web1 kernel: c7449e48 00000082 1e778e40 0001b845 00000046 00000002 f887e080 00000007 Jun 18 13:19:51 web1 kernel: dff56000 1e7e1fb4 0001b845 00069174 00000000 dff5610c c3012900 f3e77740 Jun 18 13:19:51 web1 kernel: f24491e0 00000000 00000000 ea22cb80 00000000 00000040 00000000 ea22cb80
Jun 18 13:19:51 web1 kernel: Call Trace:
Jun 18 13:19:51 web1 kernel:  [<f964f78d>] afs_access+0x320/0x337 [openafs]
Jun 18 13:19:51 web1 kernel:  [<c061d975>] __mutex_lock_slowpath+0x4d/0x7c
Jun 18 13:19:51 web1 kernel:  [<c061d9b3>] .text.lock.mutex+0xf/0x14
Jun 18 13:19:51 web1 kernel:  [<c048219b>] do_lookup+0x7a/0x174
Jun 18 13:19:51 web1 kernel:  [<c0483fc8>] __link_path_walk+0x87a/0xd4b
Jun 18 13:19:51 web1 kernel:  [<c04844d1>] link_path_walk+0x38/0x95
Jun 18 13:20:24 web1 kernel:  [<c0484892>] do_path_lookup+0x219/0x27f
Jun 18 13:20:24 web1 kernel:  [<c0484fec>] __user_walk_fd+0x29/0x3a
Jun 18 13:20:24 web1 kernel:  [<c0474e92>] sys_faccessat+0x93/0x126
Jun 18 13:20:24 web1 kernel:  [<c044bf62>] audit_syscall_entry+0x15a/0x18c
Jun 18 13:20:24 web1 kernel:  [<c0474f34>] sys_access+0xf/0x13
Jun 18 13:20:24 web1 kernel:  [<c0404f17>] syscall_call+0x7/0xb

this system is CentOS 5.5 (so it is quite out of date with several packages) 32bit with OpenAFS 1.4.14. other AFS clients did not have any problems that we are aware of, but this web server is under the heaviest load.

i suspect that the system kept spawning httpd processes as old ones got blocked and eventually it ran out of memory and became unresponsive. after a reboot it works fine. so the question is, what caused the afs cache manager to respond so slow?

can anyone confirm if they have seen kernel messages like this? how can i confirm if the problem is with the client or the server? i see no error messages in BosLog, FileLog, or VolserLog on our servers...

i may need to adjust the afsd or fileserver/volserver arguments.
the client's /etc/sysconfig/openafs
AFSD_ARGS="-dynroot -fakestat-all -daemons 6 -volumes 500 -chunksize 20 -blocks 5242880"

our servers' BosConfig lines for fileserver and volserver
parm /usr/afs/bin/fileserver -L
parm /usr/afs/bin/volserver -p 128

i saw Russ Allbery's recent message on another thread that he uses these parameter's on the fileserver, so i can try that:

/usr/lib/openafs/fileserver -L -l 1000 -s 1000 -vc 1000 -cb 200000 \
    -rxpck 800 -udpsize 1048576 -busyat 200 -vattachpar 4

thanks,

--Jonathan




--
[email protected]
Computing Services
School of Social Sciences
SSPA 4110 | 949.824.1536
_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to