Re: [OpenAFS-devel] Re: [OpenAFS] Volume root corruptions - anybody seen those?

Hartmut Reuter Thu, 05 Jun 2008 05:45:29 -0700

Small patch to make sure salvager and volserver cannot get the samevolume from the fileserver in parallel:

--- /afs/ipp/.cs/openafs/openafs-1.4.7-osd/src/vol/fssync.c2008-05-06 14:42:29.000000000 +0200

+++ ./fssync.c  2008-06-05 14:13:52.000000000 +0200
@@ -591,8 +591,19 @@
     case FSYNC_OFF:
     case FSYNC_NEEDVOLUME:{
            leaveonline = 0;

- /* not already offline, we need to find a slot for newlyoffline volume */

            if (!v) {
+               /* not already offline by this handler */
+               /* Check that no other handler has it offline */
+               int found = 0;
+               for (i = 0; i < MAXOFFLINEVOLUMES; i++) {
+                   if (volumes[i].volumeID == volume)
+                       found = 1;
+               }
+               if (found) {
+                   rc = FSYNC_DENIED;
+                   break;
+               }
+               /* Find a slot for newly offline volume */
                for (i = 0; i < MAXOFFLINEVOLUMES; i++) {
                    if (volumes[i].volumeID == 0) {
                        v = &volumes[i];


Rainer Toebbicke wrote:

Sorry, I started this in openafs-info a few days ago, but here'sprobably the better place:
the problem: "salvager -part xxx -vol NNN" corrupts volumes
(BTW, even with -nowrite!)

the scenario:
1. the volserver takes a volume offline, the salvager takes the samevolume offline without knowing about the volserver. Possible, since thepartition lock only protects the VAttachVolume, and the "offline" works(of course) even if the volume is already offline.
2. the salvager and the volserver work on same data, first opportunityfor a thorough mess.
3. assuming they didn't, e.g. salvager -nowrite, the salvager finishesand puts the volume back online while the volserver still running. Thereis no "reference count" on offline/online. The fileserver sees this, theuser jumps onto it, and we have a second opportunity to blow everythingapart.
I see a number of alternatives,
1. never use "salvager -vol", always shut down the volserver before (achallenge with bosserver restarting it), try gymnastics with "vos lock","vos offline", etc... all looks pretty cumbersome.
2. salvage a single volume within the volserver, triggered by a "vossalvage" command; probably most logical, analogous to "vos zap" et al. Ihaven't checked yet if the code can be shared without problems, lookslike some non-trivial Makefile and #ifdef gymnastics to get right. Andof course, it changes the documented commands, with changes in bos and vos.
3. have the salvager create a transaction in the volserver, which takescare of offline/online. Would have to care about deadlocks, as thesalvager doesn't just salvage the volume itself but the RW parent incase of a RO or BK. And also with the (usual) case that the volume is sodamaged that it cannot be attached. But it's minimal intrusionotherwise, no external changes.
Before I go and implement 2. or 3., please give me your thoughts aboutthe scenario if you care (I may have missed something and would have tolook for another culprit - Derrick hinted at "-orphans attach" where Idid not understand everything yet), and whether 2. is preferable over 3.
Cheers, Rainer



--
-----------------------------------------------------------------
Hartmut Reuter                  e-mail          [EMAIL PROTECTED]
                                phone            +49-89-3299-1328
                                fax              +49-89-3299-1301
RZG (Rechenzentrum Garching)    web    http://www.rzg.mpg.de/~hwr
Computing Center of the Max-Planck-Gesellschaft (MPG) and the
Institut fuer Plasmaphysik (IPP)
-----------------------------------------------------------------
_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel

Re: [OpenAFS-devel] Re: [OpenAFS] Volume root corruptions - anybody seen those?

Reply via email to