[OpenAFS] DB quorum problems on 1.4/1.6 mixed cell

Arne Wiebalck Tue, 11 Feb 2014 02:31:31 -0800

Hi,

We've recently added some 1.6.6 servers into our cell which is mainly on 1.4.15 
(i.e. most  of the file servers
and the DB servers). We now encounter quorum problems with our VLDB servers.


The primary symptom is that releases fail with "u: no quorum elected".

VLLog on the sync site shows at that moment:
-->
Tue Feb 11 09:51:47 2014 ubik:server 137.138.246.51 is back up: will be 
contacted through 137.138.246.51
Tue Feb 11 09:51:47 2014 ubik:server 137.138.246.50 is back up: will be 
contacted through 137.138.246.50
<--
where  137.138.246.50 and  137.138.246.51 are the non-sync sites.

We can relatively easy trigger this problem by moving volumes between 1.4 and 
1.6 based servers (1.4/1.4
and 1.6/1.6 transfers seems OK): the move gets stuck and udebug shows 

-->
Host's addresses are: 137.138.128.148
Host's 137.138.128.148 time is Tue Feb 11 09:38:02 2014
Local time is Tue Feb 11 09:38:02 2014 (time differential 0 secs)
Last yes vote for 137.138.128.148 was 5 secs ago (sync site);
Last vote started 5 secs ago (at Tue Feb 11 09:37:57 2014)
Local db version is 1392106724.154
I am sync site until 54 secs from now (at Tue Feb 11 09:38:56 2014) (3 servers)
Recovery state f
I am currently managing write trans 1392106724.-1892301558
Sync site's db version is 1392106724.154
0 locked pages, 0 of them for write
There are write locks held
Last time a new db version was labelled was:
         1158 secs ago (at Tue Feb 11 09:18:44 2014)

Server (137.138.246.51): (db 1392106724.153)
    last vote rcvd 20 secs ago (at Tue Feb 11 09:37:42 2014),
    last beacon sent 20 secs ago (at Tue Feb 11 09:37:42 2014), last vote was 
yes
    dbcurrent=0, up=0 beaconSince=0

Server (137.138.246.50): (db 1392106724.154)
    last vote rcvd 6 secs ago (at Tue Feb 11 09:37:56 2014),
    last beacon sent 5 secs ago (at Tue Feb 11 09:37:57 2014), last vote was yes
    dbcurrent=1, up=1 beaconSince=1
<--

Note that the sync site has gone to Recovery state f and that the time at which 
the last vote was received
on the other two servers has quite a time gap which gets larger with time. Is 
the negative trans ID ok?

At some point the sync site loses its sync site state:

-->
Host's addresses are: 137.138.128.148
Host's 137.138.128.148 time is Tue Feb 11 09:41:01 2014
Local time is Tue Feb 11 09:41:02 2014 (time differential 1 secs)
Last yes vote for 137.138.128.148 was 4 secs ago (sync site);
Last vote started 4 secs ago (at Tue Feb 11 09:40:58 2014)
Local db version is 1392106724.206
I am not sync site
Lowest host 137.138.128.148 was set 4 secs ago
Sync host 137.138.128.148 was set 4 secs ago
I am currently managing write trans 1392106724.-1892283756
Sync site's db version is 1392106724.206
0 locked pages, 0 of them for write
There are write locks held
Last time a new db version was labelled was:
         1337 secs ago (at Tue Feb 11 09:18:45 2014)
<--

so there is no sync site any longer and the vos command gets a no quorum error.

As this also happens when we do not move volumes around (like at 3am), but 
other operations such
as the backup touch the volumes, I would suspect that VLDB operations in 
general can trigger this.

Is this a known issue? 

I had understood that it should be OK to run 1.4 and 1.6 file servers in 
parallel and that the DB servers
could be updated after the file servers, but maybe that is not correct?

Thanks!
 Arne


--
Arne Wiebalck
CERN IT

smime.p7s
Description: S/MIME cryptographic signature

[OpenAFS] DB quorum problems on 1.4/1.6 mixed cell

Reply via email to