Hello again! On 3/13/14 11:01 AM, "[email protected]" <[email protected]> wrote:
>Message: 1 >To: [email protected] >From: Andrew Deason <[email protected]> >Date: Wed, 12 Mar 2014 11:04:03 -0500 >Organization: Sine Nomine Associates >Subject: [OpenAFS] Re: OpenAFS client cache overrun? > >On Wed, 12 Mar 2014 10:20:56 -0500 >Eric Chris Garrison <[email protected]> wrote: > >>3 - I had enabled a 2GB cache bypass, and it seemed to have no effect >>whatsoever. > >"cache bypass" doesn't do anything for writes, only for read operations. >That probably wasn't clear, but I didn't know before if this was just >something stuffing data into afs or reading/writing stuff, or what. Yeah, we didn't either, the user clarified. Too bad about the bypass not working for writes. > >>cmbdebug said this: >>[root@rgwb1 ~]# cmdebug localhost >>Lock afs_discon_lock status: (none_waiting, 21876 read_locks(pid:29278)) > >To be clear, this just ran and then exited on its own, right? You didn't >ctrl-C it or anything. Yes, it exited on its own after a long time. > >>[root@rgwb1 ~]# !ps >>ps -ef | grep 29278 >>root 29278 4477 0 09:27 ? 00:00:00 smbd >>root 30101 29337 0 09:37 pts/3 00:00:00 grep 29278 >>When I ran "top" I saw that the afs_cachetrim process was #1, but >>presumably wedged. >>I goosed /proc/sysrq-trigger and as promised, it dumped a lot of call >>trace info to the syslog. I'm looking through it, but am not sure what to >>look for. Nothing stands out, anyway. > >You're looking for the stack trace for the afs_cachetrim process. Look >in syslog for "afs_cachetrim", or its pid. Under that should be a trace >of functions that indicates where we are in the code at that time. > >I would extract that, and the entry for a hanging process. So, maybe >29278, or if anything hangs when touching anything in /afs, you could >get the entry for that. Oddly, there's nothing for afs_cachetrim. Mar 13 10:16:59 rgwb1 smbd[29278]: [2014/03/13 10:16:59.762359, 1] smbd/service.c:1084(make_connection_snum) Mar 13 10:16:59 rgwb1 smbd[29278]: XXXXX (::ffff:XXX.XXX.XXX.XXX) connect to service projects initially as user XXXXXX (uid=349570, gid=100) (pid 29278) Mar 13 10:17:11 rgwb1 smbd[29278]: [2014/03/13 10:17:11.703003, 1] smbd/service.c:1265(close_cnum) Mar 13 10:17:11 rgwb1 smbd[29278]: XXXXX (::ffff:XXX.XXX.XXX.XXX) closed connection to service projects Mar 13 10:17:47 rgwb1 smbd[29278]: [2014/03/13 10:17:47.708467, 0] lib/util_sock.c:474(read_fd_with_timeout) Mar 13 10:17:47 rgwb1 smbd[29278]: [2014/03/13 10:17:47.708545, 0] lib/util_sock.c:1441(get_peer_addr_internal) Mar 13 10:17:47 rgwb1 smbd[29278]: getpeername failed. Error was Transport endpoint is not connected Mar 13 10:17:47 rgwb1 smbd[29278]: read_fd_with_timeout: client 0.0.0.0 read error = Connection reset by peer. > >Or if you want to try to find "everything", just look for anything >containing the string "afs". I get just this kind of message during the last lockup: Mar 13 10:17:32 rgwb1 kernel: afs: byte-range locks only enforced for processes on this machine (pid 15613 (smbd), user 673104). Mar 13 10:19:32 rgwb1 kernel: afs: byte-range locks only enforced for processes on this machine (pid 15613 (smbd), user 673104). But we get that other times too. > >If you ever don't want to leave the system hanging while you examine it, >but you want to capture information you can examine later, you can >generate a core dump. If your system is setup to capture a core on crash >(I'm not sure if this is the default... look at RHEL documentation, it >should be something mentioning kdump or kexec), you can crash the system >and you'll get a vmcore afterwards. To do this, send a 'c' to >/proc/sysrq-trigger. That will of course crash the system and cause it >to reboot, so don't do that if that's not what you want to happen. Noted, will have to see what the defaults are for these systems. Thanks, Chris _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
