On Tue, Mar 24, 2009 at 07:15:46PM -0400, Jason Edgecombe wrote: > david l goodrich wrote: >> On Tue, Mar 24, 2009 at 10:39:24AM -0700, Russ Allbery wrote: >> >>> david l goodrich <[email protected]> writes: >>> >>> >>>> The past two nights, I've had one of my AFS fileserver go "down" >>>> >>>> I say "down" and not down because it's not totally nonfunctional. >>>> >>>> It thinks it's running fine: >>>> >>>> sprawl# bos status localhost -localauth >>>> Instance fs, currently running normally. >>>> Auxiliary status is: file server running. >>>> >>> bos status -long is generally more useful. However: >>> >> Can do: >> sprawl# bos status localhost -localauth -long >> Instance fs, (type is fs) currently running normally. >> Auxiliary status is: file server running. >> Process last started at Mon Mar 23 17:33:57 2009 (3 proc >> starts) >> Last exit at Mon Mar 23 17:33:57 2009 >> Command 1 is '/usr/pkg/libexec/openafs/fileserver' >> Command 2 is '/usr/pkg/libexec/openafs/volserver' >> Command 3 is '/usr/pkg/libexec/openafs/salvager' >> >> sprawl# ps auxw | grep /openafs/ >> root 376 0.0 0.0 2316 4 ? DW 5:33PM 0:00.83 >> /usr/pkg/libexec/openafs/volserver >> root 727 0.0 0.0 8664 2384 ? IW<a 5:33PM 0:18.29 >> /usr/pkg/libexec/openafs/fileserver >> root 6739 0.0 0.0 240 4 ttyp0 R+ 12:42PM 0:00.00 grep /openafs/ >> (ksh) >> sprawl# >> >> >>>> but none of the clients (running 1.4.8 and 1.4.6) are able to >>>> connect to the volumes on the server, despite believing that >>>> d...@chaos:~$ fs checkservers -fast -all >>>> All servers are running. >>>> d...@chaos:~$ vos listvol sprawl >>>> Could not fetch the list of partitions from the server >>>> Possible communication failure >>>> Error in vos listvol command. >>>> Possible communication failure >>>> >>> I suspect your volserver either died or went unresponsive. What version >>> of OpenAFS is the fileserver? Is there anything incriminating in >>> VolserLog or FileLog? >>> >> >> I should have been more clear - sprawl is the fileserver, it is >> running 1.4.6. There doesn't seem to be anything incriminating >> in FileLog, but let me turn up debugging on the volserver process >> on sprawl. >> >> Turning on debugging (pkill -TSTP volserver) didn't do much of >> anything - VolserLog hasn't been updated since 17:34 yesterday. >> >> It's short: >> sprawl# cat VolserLog >> Mon Mar 23 17:33:57 2009 Unable to connect to file server; will retry at need >> Mon Mar 23 17:33:57 2009 Starting AFS Volserver 2.0 >> (/usr/pkg/libexec/openafs/volserver) >> sprawl# >> > Did you run kill -TSTP volserver and fileserver 5 times each? That turns > on the maximum amount of debugging.
I think four. i'll go do a fifth after I send this. The server has spontaneously recovered (seriously. there's nothing in the logs) and /vicepa is now accessible locally. I'm suspecting some weird hardware glitch combined with a bug Derrick mentioned in 1.4.6 is the cause of this, but I am going to leave debugging turned on and see what happens overnight. Yes, I will post to the list with details. Thanks everyone, this has been a real learning experience for me. --david
signature.asc
Description: Digital signature
