Il giorno mar, 06/03/2012 alle 10.50 -0600, Andrew Deason ha scritto: > On Tue, 06 Mar 2012 12:12:52 +0100 > ProbaNet <[email protected]> wrote: > > > We have a locked volume and we are unable to unlock it (it's an > > important one, it stores a big list of dirs / mountpoints for other > > volumes). It has been locked during a release operation which is now > > aborted due to a failure. We tried: > > - vos unlock vol -verbose [CTRL+C after 20+ minutes] > > - vos unlockvldb -verbose [CTRL+C after 20+ minutes] > > Both commands failed (they wait forever with no output). The logs seems > > to be "ok" (no errors in VLLog / VolserLog / etc..). We have the quorum > > You get absolutely no output? Not even "Binding to the VLDB server" ?
Ops, you're right: with 'unlockvldb' there was that line of output, "Binding to the VLDB server", but then nothing else (also in the logs, only messages about elections). > What happens if you try with -localauth or -noauth ? Same results with -localauth, not tested with -noauth. > What version and platform is this? Can you run 'pstack <vos pid>' when > it's stuck? Too late for the pstack (problem solved for now).. :) Dbservers are debian lenny + backports (afsrm1), debian squeeze (afsmn1, afsmn3, afsor1) and gentoo (afsmn2). All x86_64 with openafs 1.4.x : for s in afsmn1 afsmn2 afsmn3 afsrm1 afsor1; do rxdebug $s 7003 -version |grep AFS; done AFS version: OpenAFS 1.4.12.1 built 2011-02-09 AFS version: OpenAFS 1.4.14 built 2011-01-31 AFS version: OpenAFS 1.4.12.1 built 2011-02-09 AFS version: OpenAFS 1.4.12.1 built 2011-02-22 AFS version: OpenAFS 1.4.12.1 built 2011-02-09 After a while we found the real problem with "udebug afsmn1 vlserver". Quorum OK (all servers vote yes for afsmn1), but different db version for server afsrm1 (dbcurrent=0, up=1 beaconSince=1). Recovery state "f". No propagation triggered.. We don't understand why.. In order to quickly solve the problem we did the following (from afsmn1): - bos stop afsrm1 vlserver - scp /var/lib/openafs/db/vldb.DB0 afsrm1:/var/lib/openafs/db/ - bos restart afsmn1 vlserver [waited until quorum OK] - bos start afsrm1 vlserver Now "udebug afsmn1 vlserver" is perfect (Revovery state 1f, dbcurrent=1 for all, same db version for all), we could unlock the volume (and the vlserver) and we could create / remove volumes and perform normal operations. But the problem is not solved.. To test the situation we tried: - bos stop afsrm1 vlserver - vos create afsmn1 a test_vol [slow, but worked] - udebug afsmn1 vlserver (db version increased in all servers but afsrm1, as expected) - bos start afsrm1 vlserver - udebug afsmn1 vlserver At this point we expected a db propagation to afsrm1.. But nothing happened in 1 hour, nothing in the logs, dbcurrent=0, different db versions and vldb frozen again.. (solved again with the scp method described above). Any suggestion? :) Thank you very much for your help! Stefano Fabio P.S.: we are planning to turn afsrm1 and afsor1 (actually regular voting dbservers) into non-voting clone-servers: is that a simple task? Any suggestion to do that? Thanks again! _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
