Re: [OpenAFS] DB quorum problems on 1.4/1.6 mixed cell

Arne Wiebalck Tue, 11 Feb 2014 10:05:51 -0800

Thanks Andrew and Derrick!

We've seen the "major synchronisation error" as well when trying to provoke 
that problem.
This and the fact that we have the very same quorum issue about one year ago 
when
restarting the VLDB servers made the problem go away for some time seem to 
indicate
it's indeed the issue you mention. This was when we first added 1.6 servers to 
our cell, btw.
Apparently, we're pretty lucky ;)


I'll restart our VLDB servers …

Thanks!
 Arne


On Feb 11, 2014, at 5:48 PM, D Brashear <sha...@gmail.com>
 wrote:

> The 1.4/1.6 issue is surely a red herring. You hit the nail when you 
> mentioned negative transaction IDs. There was a bugfix early in the 1.6 
> series which handled that; you probably want to just restart all your 
> dbservers so you can start counting up to rollover again, until you get to 
> the point of updating them.
> 
> 
> On Tue, Feb 11, 2014 at 4:50 AM, Arne Wiebalck <arne.wieba...@cern.ch> wrote:
> Hi,
> 
> We've recently added some 1.6.6 servers into our cell which is mainly on 
> 1.4.15 (i.e. most  of the file servers
> and the DB servers). We now encounter quorum problems with our VLDB servers.
> 
> The primary symptom is that releases fail with "u: no quorum elected".
> 
> VLLog on the sync site shows at that moment:
> -->
> Tue Feb 11 09:51:47 2014 ubik:server 137.138.246.51 is back up: will be 
> contacted through 137.138.246.51
> Tue Feb 11 09:51:47 2014 ubik:server 137.138.246.50 is back up: will be 
> contacted through 137.138.246.50
> <--
> where  137.138.246.50 and  137.138.246.51 are the non-sync sites.
> 
> We can relatively easy trigger this problem by moving volumes between 1.4 and 
> 1.6 based servers (1.4/1.4
> and 1.6/1.6 transfers seems OK): the move gets stuck and udebug shows 
> 
> -->
> Host's addresses are: 137.138.128.148
> Host's 137.138.128.148 time is Tue Feb 11 09:38:02 2014
> Local time is Tue Feb 11 09:38:02 2014 (time differential 0 secs)
> Last yes vote for 137.138.128.148 was 5 secs ago (sync site);
> Last vote started 5 secs ago (at Tue Feb 11 09:37:57 2014)
> Local db version is 1392106724.154
> I am sync site until 54 secs from now (at Tue Feb 11 09:38:56 2014) (3 
> servers)
> Recovery state f
> I am currently managing write trans 1392106724.-1892301558
> Sync site's db version is 1392106724.154
> 0 locked pages, 0 of them for write
> There are write locks held
> Last time a new db version was labelled was:
>          1158 secs ago (at Tue Feb 11 09:18:44 2014)
> 
> Server (137.138.246.51): (db 1392106724.153)
>     last vote rcvd 20 secs ago (at Tue Feb 11 09:37:42 2014),
>     last beacon sent 20 secs ago (at Tue Feb 11 09:37:42 2014), last vote was 
> yes
>     dbcurrent=0, up=0 beaconSince=0
> 
> Server (137.138.246.50): (db 1392106724.154)
>     last vote rcvd 6 secs ago (at Tue Feb 11 09:37:56 2014),
>     last beacon sent 5 secs ago (at Tue Feb 11 09:37:57 2014), last vote was 
> yes
>     dbcurrent=1, up=1 beaconSince=1
> <--
> 
> Note that the sync site has gone to Recovery state f and that the time at 
> which the last vote was received
> on the other two servers has quite a time gap which gets larger with time. Is 
> the negative trans ID ok?
> 
> At some point the sync site loses its sync site state:
> 
> -->
> Host's addresses are: 137.138.128.148
> Host's 137.138.128.148 time is Tue Feb 11 09:41:01 2014
> Local time is Tue Feb 11 09:41:02 2014 (time differential 1 secs)
> Last yes vote for 137.138.128.148 was 4 secs ago (sync site);
> Last vote started 4 secs ago (at Tue Feb 11 09:40:58 2014)
> Local db version is 1392106724.206
> I am not sync site
> Lowest host 137.138.128.148 was set 4 secs ago
> Sync host 137.138.128.148 was set 4 secs ago
> I am currently managing write trans 1392106724.-1892283756
> Sync site's db version is 1392106724.206
> 0 locked pages, 0 of them for write
> There are write locks held
> Last time a new db version was labelled was:
>          1337 secs ago (at Tue Feb 11 09:18:45 2014)
> <--
> 
> so there is no sync site any longer and the vos command gets a no quorum 
> error.
> 
> As this also happens when we do not move volumes around (like at 3am), but 
> other operations such
> as the backup touch the volumes, I would suspect that VLDB operations in 
> general can trigger this.
> 
> Is this a known issue? 
> 
> I had understood that it should be OK to run 1.4 and 1.6 file servers in 
> parallel and that the DB servers
> could be updated after the file servers, but maybe that is not correct?
> 
> Thanks!
>  Arne
> 
> 
> --
> Arne Wiebalck
> CERN IT
> 
> 
> 
> 
> -- 
> D

smime.p7s
Description: S/MIME cryptographic signature

Re: [OpenAFS] DB quorum problems on 1.4/1.6 mixed cell

Reply via email to