[OpenAFS] DB quorum problems on 1.4/1.6 mixed cell

2014-02-11 Thread Arne Wiebalck
Hi, We've recently added some 1.6.6 servers into our cell which is mainly on 1.4.15 (i.e. most of the file servers and the DB servers). We now encounter quorum problems with our VLDB servers. The primary symptom is that releases fail with u: no quorum elected. VLLog on the sync site shows at

[OpenAFS] Re: DB quorum problems on 1.4/1.6 mixed cell

2014-02-11 Thread Andrew Deason
On Tue, 11 Feb 2014 09:50:35 + Arne Wiebalck arne.wieba...@cern.ch wrote: I am currently managing write trans 1392106724.-1892301558 [...] Note that the sync site has gone to Recovery state f and that the time at which the last vote was received on the other two servers has quite a time

Re: [OpenAFS] DB quorum problems on 1.4/1.6 mixed cell

2014-02-11 Thread D Brashear
The 1.4/1.6 issue is surely a red herring. You hit the nail when you mentioned negative transaction IDs. There was a bugfix early in the 1.6 series which handled that; you probably want to just restart all your dbservers so you can start counting up to rollover again, until you get to the point of

Re: [OpenAFS] DB quorum problems on 1.4/1.6 mixed cell

2014-02-11 Thread Arne Wiebalck
Thanks Andrew and Derrick! We've seen the major synchronisation error as well when trying to provoke that problem. This and the fact that we have the very same quorum issue about one year ago when restarting the VLDB servers made the problem go away for some time seem to indicate it's indeed

[OpenAFS] Re: Openafs kernel problems

2014-02-11 Thread Andrew Deason
On Tue, 4 Feb 2014 09:38:27 -0500 Dave Botsch bot...@cnf.cornell.edu wrote: kernel: 2.6.32-431.3.1.el6.x86_64 Just a little bit of more information about this. Apparently the problematic code was also introduced in the RHEL 6.4 kernel series (2.6.32-358*), but was quickly pulled out.