Thanks for the information. We applied the patch to our server code and are now running that on our fileservers.
I have done some more research, and have been able to trace the problem to a specific activity - doing vos dumps. Unfortunately, I don't have any debugging information, but I would like to know if anyone else has seen problems like this. On Saturday, one of our systems began to do remote vos dumps of all of our AFS data for disaster recovery purposes. It has been doing this for almost a year now, but this was the first weekend that it was doing that since I migrated our AFS servers to the new servers that I set up. Before, we had been running the 1.2.11 code (with the big ticket patch) on our fileservers. We never saw any problems during the vos dumps against those servers. It was not until we started doing the dumps against the new 1.4.4 servers that the vos problems emerged. As I mentioned, the symptoms of the problem is that all vos operations - vos examine, vos release, etc, were extremely slow. It would take several minutes for any of the commands to return, even for volumes that weren't being backed up at that time (but which were on the same server). This problem seemed to only affect the volserver - I was able to read and write files from the affected volumes with no problems. Unfortunately, I killed the remote backup job prior to installing the volserver clone patch, so I'm not sure if that would have fixed the problem. I'm hesitant to start the remote job again since I don't want to break my AFS servers. We also run a number of vos dump jobs on the new servers that dump the local volumes to the local disk every night. During the times these dumps are running, I have seen a number of messages like the following in the VolserLog: Mon Jul 30 07:04:39 2007 trans 18756 on volume 1818562471 is older than 1350 seconds Mon Jul 30 07:05:09 2007 trans 18756 on volume 1818562471 is older than 1380 seconds Mon Jul 30 07:05:39 2007 trans 18756 on volume 1818562471 is older than 1410 seconds During the remote backup, I also saw the following errors. I have not seen these during the time the local vos dumps were being performed: Sat Jul 28 19:39:46 2007 trans 1861 on volume 1818558958 has timed out Sat Jul 28 19:39:46 2007 trans 1851 on volume 537038355 has timed out Sat Jul 28 19:41:16 2007 trans 1875 on volume 536872291 has timed out Is this something new in 1.4.4? As I said, we never saw this problem with our old 1.2.11 fileservers. Thanks again to Derrick and others for help; I really appreciate it. Brian On Sat, Jul 28, 2007 at 05:18:43PM -0400, Derrick J Brashear wrote: > you probably want the volserver clone locking patch in (i'm guessing) > src/vol/clone.c since 1.4.4 > > On Sat, 28 Jul 2007, Brian Sebby wrote: > > >We're having a strange problem that just started happening this afternoon > >on one of our fileservers that appears to be related to the volserver. > > > >We have a number of jobs that perform vos release commands, and today we > >started getting error messages from them indicating that they were timing > >out, etc. Trying to run various "vos" commands takes forever, and although > >they eventually return the information, they sit there for several minutes > >before they succeed. > > > >I'm seeing a number of messages like this in the VolserLog file: > > > >Sat Jul 28 16:02:11 2007 trans 60 on volume 1818569609 has been idle for > >more than 570 seconds > >Sat Jul 28 16:02:11 2007 trans 55 on volume 1818569660 has been idle for > >more than 600 seconds > >Sat Jul 28 16:02:11 2007 trans 55 on volume 1818569660 has timed out > >Sat Jul 28 16:02:41 2007 trans 60 on volume 1818569609 has been idle for > >more than 600 seconds > >Sat Jul 28 16:02:41 2007 trans 60 on volume 1818569609 has timed out > > > >and > > > >Sat Jul 28 15:59:41 2007 1 Volser: DumpVolume: Rx call failed during dump, > >error > >-01 > >Sat Jul 28 15:59:41 2007 1 Volser: DumpVolume: Rx call failed during dump, > >error > >-01 > >Sat Jul 28 15:59:41 2007 1 Volser: DumpVolume: Rx call failed during dump, > >error > >-01 > >Sat Jul 28 15:59:41 2007 1 Volser: DumpVolume: Rx call failed during dump, > >error > >-01 > >Sat Jul 28 15:59:41 2007 1 Volser: DumpVolume: Rx call failed during dump, > >error > >-01 > >Sat Jul 28 15:59:41 2007 1 Volser: DumpVolume: Rx call failed during dump, > >error > >-01 > >Sat Jul 28 15:59:41 2007 1 Volser: DumpVolume: Rx call failed during dump, > >error > >-01 > >Sat Jul 28 15:59:41 2007 1 Volser: DumpVolume: Rx call failed during dump, > >error > >-01 > > > >These volumes are on SAN storage, using ZFS as the backend fileserver. > >We're running the 1.4.4 namei fileserver on Solaris with the -nofsync > >patch. > > > >Here are the bos parameters we're using: > > > >Instance fs, (type is fs) currently running normally. > > Auxiliary status is: file server running. > > Process last started at Sat Jul 28 15:50:38 2007 (3 proc starts) > > Last exit at Sat Jul 28 15:50:38 2007 > > Command 1 is '/usr/afs/bin/fileserver -nojumbo -nofsync' > > Command 2 is '/usr/afs/bin/volserver -nojumbo -nofsync' > > Command 3 is '/usr/afs/bin/salvager' > > > >Any help would be greatly appreciated. > > > > > >Brian > > > >-- > >Brian Sebby ([EMAIL PROTECTED]) | Unix and Operation Services > >Phone: +1 630.252.9935 | Computing and Information Systems > >Fax: +1 630.252.4601 | Argonne National Laboratory > >_______________________________________________ > >OpenAFS-info mailing list > >[email protected] > >https://lists.openafs.org/mailman/listinfo/openafs-info > > > _______________________________________________ > OpenAFS-info mailing list > [email protected] > https://lists.openafs.org/mailman/listinfo/openafs-info -- Brian Sebby ([EMAIL PROTECTED]) | Unix and Operation Services Phone: +1 630.252.9935 | Computing and Information Systems Fax: +1 630.252.4601 | Argonne National Laboratory _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
