Re: [OpenAFS] DB quorum problems on 1.4/1.6 mixed cell

2014-02-13 Thread Arne Wiebalck
Just to confirm: I restarted all VL servers in our cell, the transaction IDs 
were reset and
so far I wasn't able to reproduce the problem. So it seems that it was indeed 
the
negative transaction ID problem in 1.4 VL servers mentioned earlier in this 
thread.

Cheers,
 Arne 


On Feb 11, 2014, at 6:25 PM, Arne Wiebalck arne.wieba...@cern.ch wrote:

 Thanks Andrew and Derrick!
 
 We've seen the major synchronisation error as well when trying to provoke 
 that problem.
 This and the fact that we have the very same quorum issue about one year ago 
 when
 restarting the VLDB servers made the problem go away for some time seem to 
 indicate
 it's indeed the issue you mention. This was when we first added 1.6 servers 
 to our cell, btw.
 Apparently, we're pretty lucky ;)
 
 I'll restart our VLDB servers …
 
 Thanks!
  Arne
 
 
 On Feb 11, 2014, at 5:48 PM, D Brashear sha...@gmail.com
  wrote:
 
 The 1.4/1.6 issue is surely a red herring. You hit the nail when you 
 mentioned negative transaction IDs. There was a bugfix early in the 1.6 
 series which handled that; you probably want to just restart all your 
 dbservers so you can start counting up to rollover again, until you get to 
 the point of updating them.
 
 
 On Tue, Feb 11, 2014 at 4:50 AM, Arne Wiebalck arne.wieba...@cern.ch wrote:
 Hi,
 
 We've recently added some 1.6.6 servers into our cell which is mainly on 
 1.4.15 (i.e. most  of the file servers
 and the DB servers). We now encounter quorum problems with our VLDB servers.
 
 The primary symptom is that releases fail with u: no quorum elected.
 
 VLLog on the sync site shows at that moment:
 --
 Tue Feb 11 09:51:47 2014 ubik:server 137.138.246.51 is back up: will be 
 contacted through 137.138.246.51
 Tue Feb 11 09:51:47 2014 ubik:server 137.138.246.50 is back up: will be 
 contacted through 137.138.246.50
 --
 where  137.138.246.50 and  137.138.246.51 are the non-sync sites.
 
 We can relatively easy trigger this problem by moving volumes between 1.4 
 and 1.6 based servers (1.4/1.4
 and 1.6/1.6 transfers seems OK): the move gets stuck and udebug shows 
 
 --
 Host's addresses are: 137.138.128.148
 Host's 137.138.128.148 time is Tue Feb 11 09:38:02 2014
 Local time is Tue Feb 11 09:38:02 2014 (time differential 0 secs)
 Last yes vote for 137.138.128.148 was 5 secs ago (sync site);
 Last vote started 5 secs ago (at Tue Feb 11 09:37:57 2014)
 Local db version is 1392106724.154
 I am sync site until 54 secs from now (at Tue Feb 11 09:38:56 2014) (3 
 servers)
 Recovery state f
 I am currently managing write trans 1392106724.-1892301558
 Sync site's db version is 1392106724.154
 0 locked pages, 0 of them for write
 There are write locks held
 Last time a new db version was labelled was:
  1158 secs ago (at Tue Feb 11 09:18:44 2014)
 
 Server (137.138.246.51): (db 1392106724.153)
 last vote rcvd 20 secs ago (at Tue Feb 11 09:37:42 2014),
 last beacon sent 20 secs ago (at Tue Feb 11 09:37:42 2014), last vote 
 was yes
 dbcurrent=0, up=0 beaconSince=0
 
 Server (137.138.246.50): (db 1392106724.154)
 last vote rcvd 6 secs ago (at Tue Feb 11 09:37:56 2014),
 last beacon sent 5 secs ago (at Tue Feb 11 09:37:57 2014), last vote was 
 yes
 dbcurrent=1, up=1 beaconSince=1
 --
 
 Note that the sync site has gone to Recovery state f and that the time at 
 which the last vote was received
 on the other two servers has quite a time gap which gets larger with time. 
 Is the negative trans ID ok?
 
 At some point the sync site loses its sync site state:
 
 --
 Host's addresses are: 137.138.128.148
 Host's 137.138.128.148 time is Tue Feb 11 09:41:01 2014
 Local time is Tue Feb 11 09:41:02 2014 (time differential 1 secs)
 Last yes vote for 137.138.128.148 was 4 secs ago (sync site);
 Last vote started 4 secs ago (at Tue Feb 11 09:40:58 2014)
 Local db version is 1392106724.206
 I am not sync site
 Lowest host 137.138.128.148 was set 4 secs ago
 Sync host 137.138.128.148 was set 4 secs ago
 I am currently managing write trans 1392106724.-1892283756
 Sync site's db version is 1392106724.206
 0 locked pages, 0 of them for write
 There are write locks held
 Last time a new db version was labelled was:
  1337 secs ago (at Tue Feb 11 09:18:45 2014)
 --
 
 so there is no sync site any longer and the vos command gets a no quorum 
 error.
 
 As this also happens when we do not move volumes around (like at 3am), but 
 other operations such
 as the backup touch the volumes, I would suspect that VLDB operations in 
 general can trigger this.
 
 Is this a known issue? 
 
 I had understood that it should be OK to run 1.4 and 1.6 file servers in 
 parallel and that the DB servers
 could be updated after the file servers, but maybe that is not correct?
 
 Thanks!
  Arne
 
 
 --
 Arne Wiebalck
 CERN IT
 
 
 
 
 -- 
 D
 



smime.p7s
Description: S/MIME cryptographic signature


Re: [OpenAFS] DB quorum problems on 1.4/1.6 mixed cell

2014-02-13 Thread Michael Garrison
At UMich we ran into the same issue a while ago, and I wound up back
porting the patch to 1.4. Since then, haven't had the negative
transaction ID issue crop up again.

--
Mike Garrison

On Thu, Feb 13, 2014 at 10:29 AM, Arne Wiebalck arne.wieba...@cern.ch wrote:
 Just to confirm: I restarted all VL servers in our cell, the transaction IDs
 were reset and
 so far I wasn't able to reproduce the problem. So it seems that it was
 indeed the
 negative transaction ID problem in 1.4 VL servers mentioned earlier in this
 thread.

 Cheers,
  Arne


 On Feb 11, 2014, at 6:25 PM, Arne Wiebalck arne.wieba...@cern.ch wrote:

 Thanks Andrew and Derrick!

 We've seen the major synchronisation error as well when trying to provoke
 that problem.
 This and the fact that we have the very same quorum issue about one year ago
 when
 restarting the VLDB servers made the problem go away for some time seem to
 indicate
 it's indeed the issue you mention. This was when we first added 1.6 servers
 to our cell, btw.
 Apparently, we're pretty lucky ;)

 I'll restart our VLDB servers ...

 Thanks!
  Arne


 On Feb 11, 2014, at 5:48 PM, D Brashear sha...@gmail.com
  wrote:

 The 1.4/1.6 issue is surely a red herring. You hit the nail when you
 mentioned negative transaction IDs. There was a bugfix early in the 1.6
 series which handled that; you probably want to just restart all your
 dbservers so you can start counting up to rollover again, until you get to
 the point of updating them.


 On Tue, Feb 11, 2014 at 4:50 AM, Arne Wiebalck arne.wieba...@cern.ch
 wrote:

 Hi,

 We've recently added some 1.6.6 servers into our cell which is mainly on
 1.4.15 (i.e. most  of the file servers
 and the DB servers). We now encounter quorum problems with our VLDB
 servers.

 The primary symptom is that releases fail with u: no quorum elected.

 VLLog on the sync site shows at that moment:
 --
 Tue Feb 11 09:51:47 2014 ubik:server 137.138.246.51 is back up: will be
 contacted through 137.138.246.51
 Tue Feb 11 09:51:47 2014 ubik:server 137.138.246.50 is back up: will be
 contacted through 137.138.246.50
 --
 where  137.138.246.50 and  137.138.246.51 are the non-sync sites.

 We can relatively easy trigger this problem by moving volumes between 1.4
 and 1.6 based servers (1.4/1.4
 and 1.6/1.6 transfers seems OK): the move gets stuck and udebug shows

 --
 Host's addresses are: 137.138.128.148
 Host's 137.138.128.148 time is Tue Feb 11 09:38:02 2014
 Local time is Tue Feb 11 09:38:02 2014 (time differential 0 secs)
 Last yes vote for 137.138.128.148 was 5 secs ago (sync site);
 Last vote started 5 secs ago (at Tue Feb 11 09:37:57 2014)
 Local db version is 1392106724.154
 I am sync site until 54 secs from now (at Tue Feb 11 09:38:56 2014) (3
 servers)
 Recovery state f
 I am currently managing write trans 1392106724.-1892301558
 Sync site's db version is 1392106724.154
 0 locked pages, 0 of them for write
 There are write locks held
 Last time a new db version was labelled was:
  1158 secs ago (at Tue Feb 11 09:18:44 2014)

 Server (137.138.246.51): (db 1392106724.153)
 last vote rcvd 20 secs ago (at Tue Feb 11 09:37:42 2014),
 last beacon sent 20 secs ago (at Tue Feb 11 09:37:42 2014), last vote
 was yes
 dbcurrent=0, up=0 beaconSince=0

 Server (137.138.246.50): (db 1392106724.154)
 last vote rcvd 6 secs ago (at Tue Feb 11 09:37:56 2014),
 last beacon sent 5 secs ago (at Tue Feb 11 09:37:57 2014), last vote
 was yes
 dbcurrent=1, up=1 beaconSince=1
 --

 Note that the sync site has gone to Recovery state f and that the time at
 which the last vote was received
 on the other two servers has quite a time gap which gets larger with time.
 Is the negative trans ID ok?

 At some point the sync site loses its sync site state:

 --
 Host's addresses are: 137.138.128.148
 Host's 137.138.128.148 time is Tue Feb 11 09:41:01 2014
 Local time is Tue Feb 11 09:41:02 2014 (time differential 1 secs)
 Last yes vote for 137.138.128.148 was 4 secs ago (sync site);
 Last vote started 4 secs ago (at Tue Feb 11 09:40:58 2014)
 Local db version is 1392106724.206
 I am not sync site
 Lowest host 137.138.128.148 was set 4 secs ago
 Sync host 137.138.128.148 was set 4 secs ago
 I am currently managing write trans 1392106724.-1892283756
 Sync site's db version is 1392106724.206
 0 locked pages, 0 of them for write
 There are write locks held
 Last time a new db version was labelled was:
  1337 secs ago (at Tue Feb 11 09:18:45 2014)
 --

 so there is no sync site any longer and the vos command gets a no quorum
 error.

 As this also happens when we do not move volumes around (like at 3am), but
 other operations such
 as the backup touch the volumes, I would suspect that VLDB operations in
 general can trigger this.

 Is this a known issue?

 I had understood that it should be OK to run 1.4 and 1.6 file servers in
 parallel and that the DB servers
 could be updated after the file servers, but maybe that is not correct?

 Thanks!
  

[OpenAFS] DB quorum problems on 1.4/1.6 mixed cell

2014-02-11 Thread Arne Wiebalck
Hi,

We've recently added some 1.6.6 servers into our cell which is mainly on 1.4.15 
(i.e. most  of the file servers
and the DB servers). We now encounter quorum problems with our VLDB servers.

The primary symptom is that releases fail with u: no quorum elected.

VLLog on the sync site shows at that moment:
--
Tue Feb 11 09:51:47 2014 ubik:server 137.138.246.51 is back up: will be 
contacted through 137.138.246.51
Tue Feb 11 09:51:47 2014 ubik:server 137.138.246.50 is back up: will be 
contacted through 137.138.246.50
--
where  137.138.246.50 and  137.138.246.51 are the non-sync sites.

We can relatively easy trigger this problem by moving volumes between 1.4 and 
1.6 based servers (1.4/1.4
and 1.6/1.6 transfers seems OK): the move gets stuck and udebug shows 

--
Host's addresses are: 137.138.128.148
Host's 137.138.128.148 time is Tue Feb 11 09:38:02 2014
Local time is Tue Feb 11 09:38:02 2014 (time differential 0 secs)
Last yes vote for 137.138.128.148 was 5 secs ago (sync site);
Last vote started 5 secs ago (at Tue Feb 11 09:37:57 2014)
Local db version is 1392106724.154
I am sync site until 54 secs from now (at Tue Feb 11 09:38:56 2014) (3 servers)
Recovery state f
I am currently managing write trans 1392106724.-1892301558
Sync site's db version is 1392106724.154
0 locked pages, 0 of them for write
There are write locks held
Last time a new db version was labelled was:
 1158 secs ago (at Tue Feb 11 09:18:44 2014)

Server (137.138.246.51): (db 1392106724.153)
last vote rcvd 20 secs ago (at Tue Feb 11 09:37:42 2014),
last beacon sent 20 secs ago (at Tue Feb 11 09:37:42 2014), last vote was 
yes
dbcurrent=0, up=0 beaconSince=0

Server (137.138.246.50): (db 1392106724.154)
last vote rcvd 6 secs ago (at Tue Feb 11 09:37:56 2014),
last beacon sent 5 secs ago (at Tue Feb 11 09:37:57 2014), last vote was yes
dbcurrent=1, up=1 beaconSince=1
--

Note that the sync site has gone to Recovery state f and that the time at which 
the last vote was received
on the other two servers has quite a time gap which gets larger with time. Is 
the negative trans ID ok?

At some point the sync site loses its sync site state:

--
Host's addresses are: 137.138.128.148
Host's 137.138.128.148 time is Tue Feb 11 09:41:01 2014
Local time is Tue Feb 11 09:41:02 2014 (time differential 1 secs)
Last yes vote for 137.138.128.148 was 4 secs ago (sync site);
Last vote started 4 secs ago (at Tue Feb 11 09:40:58 2014)
Local db version is 1392106724.206
I am not sync site
Lowest host 137.138.128.148 was set 4 secs ago
Sync host 137.138.128.148 was set 4 secs ago
I am currently managing write trans 1392106724.-1892283756
Sync site's db version is 1392106724.206
0 locked pages, 0 of them for write
There are write locks held
Last time a new db version was labelled was:
 1337 secs ago (at Tue Feb 11 09:18:45 2014)
--

so there is no sync site any longer and the vos command gets a no quorum error.

As this also happens when we do not move volumes around (like at 3am), but 
other operations such
as the backup touch the volumes, I would suspect that VLDB operations in 
general can trigger this.

Is this a known issue? 

I had understood that it should be OK to run 1.4 and 1.6 file servers in 
parallel and that the DB servers
could be updated after the file servers, but maybe that is not correct?

Thanks!
 Arne


--
Arne Wiebalck
CERN IT



smime.p7s
Description: S/MIME cryptographic signature


Re: [OpenAFS] DB quorum problems on 1.4/1.6 mixed cell

2014-02-11 Thread D Brashear
The 1.4/1.6 issue is surely a red herring. You hit the nail when you
mentioned negative transaction IDs. There was a bugfix early in the 1.6
series which handled that; you probably want to just restart all your
dbservers so you can start counting up to rollover again, until you get to
the point of updating them.


On Tue, Feb 11, 2014 at 4:50 AM, Arne Wiebalck arne.wieba...@cern.chwrote:

 Hi,

 We've recently added some 1.6.6 servers into our cell which is mainly on
 1.4.15 (i.e. most  of the file servers
 and the DB servers). We now encounter quorum problems with our VLDB
 servers.

 The primary symptom is that releases fail with u: no quorum elected.

 VLLog on the sync site shows at that moment:
 --
 Tue Feb 11 09:51:47 2014 ubik:server 137.138.246.51 is back up: will be
 contacted through 137.138.246.51
 Tue Feb 11 09:51:47 2014 ubik:server 137.138.246.50 is back up: will be
 contacted through 137.138.246.50
 --
 where  137.138.246.50 and  137.138.246.51 are the non-sync sites.

 We can relatively easy trigger this problem by moving volumes between 1.4
 and 1.6 based servers (1.4/1.4
 and 1.6/1.6 transfers seems OK): the move gets stuck and udebug shows

 --
 Host's addresses are: 137.138.128.148
 Host's 137.138.128.148 time is Tue Feb 11 09:38:02 2014
 Local time is Tue Feb 11 09:38:02 2014 (time differential 0 secs)
 Last yes vote for 137.138.128.148 was 5 secs ago (sync site);
 Last vote started 5 secs ago (at Tue Feb 11 09:37:57 2014)
 Local db version is 1392106724.154
 I am sync site until 54 secs from now (at Tue Feb 11 09:38:56 2014) (3
 servers)
 Recovery state f
 I am currently managing write trans 1392106724.-1892301558
 Sync site's db version is 1392106724.154
 0 locked pages, 0 of them for write
 There are write locks held
 Last time a new db version was labelled was:
  1158 secs ago (at Tue Feb 11 09:18:44 2014)

 Server (137.138.246.51): (db 1392106724.153)
 last vote rcvd 20 secs ago (at Tue Feb 11 09:37:42 2014),
 last beacon sent 20 secs ago (at Tue Feb 11 09:37:42 2014), last vote
 was yes
 dbcurrent=0, up=0 beaconSince=0

 Server (137.138.246.50): (db 1392106724.154)
 last vote rcvd 6 secs ago (at Tue Feb 11 09:37:56 2014),
 last beacon sent 5 secs ago (at Tue Feb 11 09:37:57 2014), last vote
 was yes
 dbcurrent=1, up=1 beaconSince=1
 --

 Note that the sync site has gone to Recovery state f and that the time at
 which the last vote was received
 on the other two servers has quite a time gap which gets larger with time.
 Is the negative trans ID ok?

 At some point the sync site loses its sync site state:

 --
 Host's addresses are: 137.138.128.148
 Host's 137.138.128.148 time is Tue Feb 11 09:41:01 2014
 Local time is Tue Feb 11 09:41:02 2014 (time differential 1 secs)
 Last yes vote for 137.138.128.148 was 4 secs ago (sync site);
 Last vote started 4 secs ago (at Tue Feb 11 09:40:58 2014)
 Local db version is 1392106724.206
 I am not sync site
 Lowest host 137.138.128.148 was set 4 secs ago
 Sync host 137.138.128.148 was set 4 secs ago
 I am currently managing write trans 1392106724.-1892283756
 Sync site's db version is 1392106724.206
 0 locked pages, 0 of them for write
 There are write locks held
 Last time a new db version was labelled was:
  1337 secs ago (at Tue Feb 11 09:18:45 2014)
 --

 so there is no sync site any longer and the vos command gets a no quorum
 error.

 As this also happens when we do not move volumes around (like at 3am), but
 other operations such
 as the backup touch the volumes, I would suspect that VLDB operations in
 general can trigger this.

 Is this a known issue?

 I had understood that it should be OK to run 1.4 and 1.6 file servers in
 parallel and that the DB servers
 could be updated after the file servers, but maybe that is not correct?

 Thanks!
  Arne


 --
 Arne Wiebalck
 CERN IT




-- 
D


Re: [OpenAFS] DB quorum problems on 1.4/1.6 mixed cell

2014-02-11 Thread Arne Wiebalck
Thanks Andrew and Derrick!

We've seen the major synchronisation error as well when trying to provoke 
that problem.
This and the fact that we have the very same quorum issue about one year ago 
when
restarting the VLDB servers made the problem go away for some time seem to 
indicate
it's indeed the issue you mention. This was when we first added 1.6 servers to 
our cell, btw.
Apparently, we're pretty lucky ;)

I'll restart our VLDB servers …

Thanks!
 Arne


On Feb 11, 2014, at 5:48 PM, D Brashear sha...@gmail.com
 wrote:

 The 1.4/1.6 issue is surely a red herring. You hit the nail when you 
 mentioned negative transaction IDs. There was a bugfix early in the 1.6 
 series which handled that; you probably want to just restart all your 
 dbservers so you can start counting up to rollover again, until you get to 
 the point of updating them.
 
 
 On Tue, Feb 11, 2014 at 4:50 AM, Arne Wiebalck arne.wieba...@cern.ch wrote:
 Hi,
 
 We've recently added some 1.6.6 servers into our cell which is mainly on 
 1.4.15 (i.e. most  of the file servers
 and the DB servers). We now encounter quorum problems with our VLDB servers.
 
 The primary symptom is that releases fail with u: no quorum elected.
 
 VLLog on the sync site shows at that moment:
 --
 Tue Feb 11 09:51:47 2014 ubik:server 137.138.246.51 is back up: will be 
 contacted through 137.138.246.51
 Tue Feb 11 09:51:47 2014 ubik:server 137.138.246.50 is back up: will be 
 contacted through 137.138.246.50
 --
 where  137.138.246.50 and  137.138.246.51 are the non-sync sites.
 
 We can relatively easy trigger this problem by moving volumes between 1.4 and 
 1.6 based servers (1.4/1.4
 and 1.6/1.6 transfers seems OK): the move gets stuck and udebug shows 
 
 --
 Host's addresses are: 137.138.128.148
 Host's 137.138.128.148 time is Tue Feb 11 09:38:02 2014
 Local time is Tue Feb 11 09:38:02 2014 (time differential 0 secs)
 Last yes vote for 137.138.128.148 was 5 secs ago (sync site);
 Last vote started 5 secs ago (at Tue Feb 11 09:37:57 2014)
 Local db version is 1392106724.154
 I am sync site until 54 secs from now (at Tue Feb 11 09:38:56 2014) (3 
 servers)
 Recovery state f
 I am currently managing write trans 1392106724.-1892301558
 Sync site's db version is 1392106724.154
 0 locked pages, 0 of them for write
 There are write locks held
 Last time a new db version was labelled was:
  1158 secs ago (at Tue Feb 11 09:18:44 2014)
 
 Server (137.138.246.51): (db 1392106724.153)
 last vote rcvd 20 secs ago (at Tue Feb 11 09:37:42 2014),
 last beacon sent 20 secs ago (at Tue Feb 11 09:37:42 2014), last vote was 
 yes
 dbcurrent=0, up=0 beaconSince=0
 
 Server (137.138.246.50): (db 1392106724.154)
 last vote rcvd 6 secs ago (at Tue Feb 11 09:37:56 2014),
 last beacon sent 5 secs ago (at Tue Feb 11 09:37:57 2014), last vote was 
 yes
 dbcurrent=1, up=1 beaconSince=1
 --
 
 Note that the sync site has gone to Recovery state f and that the time at 
 which the last vote was received
 on the other two servers has quite a time gap which gets larger with time. Is 
 the negative trans ID ok?
 
 At some point the sync site loses its sync site state:
 
 --
 Host's addresses are: 137.138.128.148
 Host's 137.138.128.148 time is Tue Feb 11 09:41:01 2014
 Local time is Tue Feb 11 09:41:02 2014 (time differential 1 secs)
 Last yes vote for 137.138.128.148 was 4 secs ago (sync site);
 Last vote started 4 secs ago (at Tue Feb 11 09:40:58 2014)
 Local db version is 1392106724.206
 I am not sync site
 Lowest host 137.138.128.148 was set 4 secs ago
 Sync host 137.138.128.148 was set 4 secs ago
 I am currently managing write trans 1392106724.-1892283756
 Sync site's db version is 1392106724.206
 0 locked pages, 0 of them for write
 There are write locks held
 Last time a new db version was labelled was:
  1337 secs ago (at Tue Feb 11 09:18:45 2014)
 --
 
 so there is no sync site any longer and the vos command gets a no quorum 
 error.
 
 As this also happens when we do not move volumes around (like at 3am), but 
 other operations such
 as the backup touch the volumes, I would suspect that VLDB operations in 
 general can trigger this.
 
 Is this a known issue? 
 
 I had understood that it should be OK to run 1.4 and 1.6 file servers in 
 parallel and that the DB servers
 could be updated after the file servers, but maybe that is not correct?
 
 Thanks!
  Arne
 
 
 --
 Arne Wiebalck
 CERN IT
 
 
 
 
 -- 
 D



smime.p7s
Description: S/MIME cryptographic signature