Wow. Just as you send it out....you find the problem. Always the way :-)

This turned out to be a subtle network problem. There was a change in the
cross colo VPN link by our provider, and while it didn't affect the
majority of our traffic, apparently -some- udp traffic was affected, due to
some encapsulation changes.

I discovered this when I decided to bypass the network mesh entirely, just
to 'try it', and things started sync'ing again.

I dug deeper, and there was indeed a subtle network fault.

So.. there you go.. if anyone else is beating their heads over a Ubik
problem like this, try bypassing parts of your network to see if there is a
network fault hiding amongst the trees.

Best,


On Sun, Jan 12, 2014 at 3:54 PM, Timothy Balcer <[email protected]> wrote:

> Hey folks. Odd problem. I have gone over many things, and I am stumped. I
> am tempted to just destroy and rebuild my errant vlserver, but I'd like to
> know what's going on and I know I am missing something :-)
>
> I have three vlservers, one in each of my colos. Lets call them A
> (10.33.10.43), B(10.36.10.7) and C(10.38.10.7)
>
> A and B sync with each other
> B and C sync with each other (or try to) but fail
> A and C do not.
>
> They do know about each other, as a udebug -long shows that. However
> between A and C I see this in the long form:
>
> Server (10.38.10.7): (db 0.0)
>     last vote never rcvd
>     last beacon never sent
>     dbcurrent=0, up=0 beaconSince=0
>
> Same is true for C to A.
>
> The database version is completely off between A and C.
>
> This is the complete udebug from server C, which is the one that is acting
> up:
>
> Host's addresses are: 10.38.10.7
> Host's 10.38.10.7 time is Sun Jan 12 23:04:13 2014
> Local time is Sun Jan 12 23:04:16 2014 (time differential 3 secs)
> Last yes vote for 10.38.10.7 was 2 secs ago (not sync site);
> Last vote started 2 secs ago (at Sun Jan 12 23:04:14 2014)
> Local db version is 1388991001.15777
> I am not sync site
> Lowest host 10.38.10.7 was set 2 secs ago
> Sync host 0.0.0.0 was set 1389567853 secs ago
> The last trans I handled was 0.46
> Sync site's db version is 1388991001.15777
> 0 locked pages, 0 of them for write
>
> Server (10.36.10.7): (db 0.0)
>     last vote rcvd 2 secs ago (at Sun Jan 12 23:04:14 2014),
>     last beacon sent 2 secs ago (at Sun Jan 12 23:04:14 2014), last vote
> was no
>     dbcurrent=0, up=1 beaconSince=1
>
> Server (10.33.10.43): (db 0.0)
>     last vote never rcvd
>     last beacon never sent
>     dbcurrent=0, up=0 beaconSince=0
>
> So it thinks it is not the sync site, but 0.0.0.0 is. In reality, server A
> is the sync site.
>
> I have run tcpdump to make absolutely certain there are packets running
> between the two hosts on all AFS ports, and there are.
>
> KeyFiles are fine, and checked with md5sum.
>
> Set debug to 25 on the affected vlserver (10.38.10.7, Server C) and got
> this:
>
> Sun Jan 12 23:19:24 2014 Using 10.38.10.7 as my primary address
> Sun Jan 12 23:19:40 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:19:40 2014 Starting AFS vlserver 4
> (/usr/lib/openafs/vlserver -rxbind -d 25)
> Sun Jan 12 23:19:40 2014 no vote from 10.36.10.7
> @(#) OpenAFS 1.6.1-1+ubuntu0.2-debian built  2013-07-24
> 12 23:19:40 2014 Received beacon type 0 from host 10.38.10.7
> Sun Jan 12 23:19:44 2014 recovery running in state 0
> Sun Jan 12 23:19:55 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:19:55 2014 no vote from 10.36.10.7
> Sun Jan 12 23:19:55 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:19:55 2014 Received beacon type 0 from host 10.38.10.7
> Sun Jan 12 23:20:10 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:20:10 2014 no vote from 10.36.10.7
> Sun Jan 12 23:20:10 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:20:10 2014 Received beacon type 0 from host 10.38.10.7
> Sun Jan 12 23:20:25 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:20:25 2014 no vote from 10.36.10.7
> Sun Jan 12 23:20:25 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:20:25 2014 Received beacon type 0 from host 10.38.10.7
> Sun Jan 12 23:20:40 2014 ubik:server 10.33.10.43 still down
> Sun Jan 12 23:20:40 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:20:40 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:20:40 2014 no vote from 10.36.10.7
> Sun Jan 12 23:20:40 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:20:40 2014 Received beacon type 0 from host 10.38.10.7
> Sun Jan 12 23:20:40 2014 Ubik: vote 'yes' for 10.38.10.7 (NOT in quorum)
> Sun Jan 12 23:20:44 2014 recovery running in state 0
> Sun Jan 12 23:20:44 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:20:48 2014 recovery running in state 0
> Sun Jan 12 23:20:48 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:20:52 2014 recovery running in state 0
> Sun Jan 12 23:20:52 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:20:55 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:20:55 2014 no vote from 10.36.10.7
> Sun Jan 12 23:20:55 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:20:55 2014 Received beacon type 0 from host 10.38.10.7
> Sun Jan 12 23:20:56 2014 recovery running in state 0
> Sun Jan 12 23:20:56 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:21:00 2014 recovery running in state 0
> Sun Jan 12 23:21:00 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:21:04 2014 recovery running in state 0
> Sun Jan 12 23:21:04 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:21:07 2014 allbetter checking
> Sun Jan 12 23:21:07 2014 allbetter: returning 1
> Sun Jan 12 23:21:07 2014 allbetter checking
> Sun Jan 12 23:21:07 2014 allbetter: returning 1
> Sun Jan 12 23:21:07 2014 allbetter checking
> Sun Jan 12 23:21:07 2014 allbetter: returning 1
> Sun Jan 12 23:21:07 2014 allbetter checking
> Sun Jan 12 23:21:07 2014 allbetter: returning 1
> Sun Jan 12 23:21:07 2014 allbetter checking
> Sun Jan 12 23:21:07 2014 allbetter: returning 1
> Sun Jan 12 23:21:07 2014 GetVolumeByID 536870913 (2) 10.38.10.83 noauth
> Sun Jan 12 23:21:07 2014 allbetter checking
> Sun Jan 12 23:21:07 2014 allbetter: returning 1
> Sun Jan 12 23:21:07 2014 allbetter checking
> Sun Jan 12 23:21:07 2014 allbetter: returning 1
> Sun Jan 12 23:21:07 2014 allbetter checking
> Sun Jan 12 23:21:07 2014 allbetter: returning 1
> Sun Jan 12 23:21:07 2014 allbetter checking
> Sun Jan 12 23:21:07 2014 allbetter: returning 1
> Sun Jan 12 23:21:07 2014 allbetter checking
> Sun Jan 12 23:21:07 2014 allbetter: returning 1
> Sun Jan 12 23:21:07 2014 allbetter checking
> Sun Jan 12 23:21:07 2014 allbetter: returning 1
> Sun Jan 12 23:21:07 2014 GetVolumeByID 536870913 (2) 10.38.10.83 noauth
> Sun Jan 12 23:21:07 2014 allbetter checking
> Sun Jan 12 23:21:07 2014 allbetter: returning 1
> Sun Jan 12 23:21:07 2014 allbetter checking
> Sun Jan 12 23:21:07 2014 allbetter: returning 1
> Sun Jan 12 23:21:07 2014 allbetter checking
> Sun Jan 12 23:21:07 2014 allbetter: returning 1
> Sun Jan 12 23:21:07 2014 allbetter checking
> Sun Jan 12 23:21:07 2014 allbetter: returning 1
> Sun Jan 12 23:21:07 2014 allbetter checking
> Sun Jan 12 23:21:07 2014 allbetter: returning 1
> Sun Jan 12 23:21:08 2014 recovery running in state 0
> Sun Jan 12 23:21:08 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:21:10 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:21:10 2014 no vote from 10.36.10.7
> Sun Jan 12 23:21:10 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:21:10 2014 Received beacon type 0 from host 10.38.10.7
> Sun Jan 12 23:21:12 2014 recovery running in state 0
> Sun Jan 12 23:21:25 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:21:25 2014 no vote from 10.36.10.7
> Sun Jan 12 23:21:25 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:21:25 2014 Received beacon type 0 from host 10.38.10.7
> Sun Jan 12 23:21:29 2014 allbetter checking
> Sun Jan 12 23:21:29 2014 allbetter: returning 1
> Sun Jan 12 23:21:29 2014 GetVolumeByName <snipped>
> Sun Jan 12 23:21:29 2014 allbetter checking
> Sun Jan 12 23:21:29 2014 allbetter: returning 1
> Sun Jan 12 23:21:29 2014 allbetter checking
> Sun Jan 12 23:21:29 2014 allbetter: returning 1
> Sun Jan 12 23:21:29 2014 allbetter checking
> Sun Jan 12 23:21:29 2014 allbetter: returning 1
> Sun Jan 12 23:21:30 2014 allbetter checking
> Sun Jan 12 23:21:30 2014 allbetter: returning 1
> Sun Jan 12 23:21:30 2014 GetVolumeByName *<snipped>*
> Sun Jan 12 23:21:30 2014 allbetter checking
> Sun Jan 12 23:21:30 2014 allbetter: returning 1
> Sun Jan 12 23:21:30 2014 allbetter checking
> Sun Jan 12 23:21:30 2014 allbetter: returning 1
> Sun Jan 12 23:21:30 2014 allbetter checking
> Sun Jan 12 23:21:30 2014 allbetter: returning 1
> Sun Jan 12 23:21:30 2014 allbetter checking
> Sun Jan 12 23:21:30 2014 allbetter: returning 1
> Sun Jan 12 23:21:30 2014 allbetter checking
> Sun Jan 12 23:21:30 2014 allbetter: returning 1
> Sun Jan 12 23:21:30 2014 allbetter checking
> Sun Jan 12 23:21:30 2014 allbetter: returning 1
> Sun Jan 12 23:21:30 2014 GetVolumeByName *<snipped>*
> Sun Jan 12 23:21:30 2014 allbetter checking
> Sun Jan 12 23:21:30 2014 allbetter: returning 1
> Sun Jan 12 23:21:30 2014 allbetter checking
> Sun Jan 12 23:21:30 2014 allbetter: returning 1
> Sun Jan 12 23:21:30 2014 allbetter checking
> Sun Jan 12 23:21:30 2014 allbetter: returning 1
> Sun Jan 12 23:21:30 2014 allbetter checking
> Sun Jan 12 23:21:30 2014 allbetter: returning 1
> Sun Jan 12 23:21:30 2014 allbetter checking
> Sun Jan 12 23:21:30 2014 allbetter: returning 1
> Sun Jan 12 23:21:30 2014 allbetter checking
> Sun Jan 12 23:21:30 2014 allbetter: returning 1
> Sun Jan 12 23:21:30 2014 GetVolumeByName *<snipped>*
> Sun Jan 12 23:21:30 2014 allbetter checking
> Sun Jan 12 23:21:30 2014 allbetter: returning 1
> Sun Jan 12 23:21:30 2014 allbetter checking
> Sun Jan 12 23:21:30 2014 allbetter: returning 1
> Sun Jan 12 23:21:30 2014 allbetter checking
> Sun Jan 12 23:21:30 2014 allbetter: returning 1
> Sun Jan 12 23:21:40 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:21:40 2014 no vote from 10.36.10.7
> Sun Jan 12 23:21:40 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:21:40 2014 Received beacon type 0 from host 10.38.10.7
> Sun Jan 12 23:21:55 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:21:55 2014 no vote from 10.36.10.7
> Sun Jan 12 23:21:55 2014 beacon: amSyncSite is 0
> Sun Jan 12 23:21:55 2014 Received beacon type 0 from host 10.38.10.7
>
> Any help is appreciated! I have already upgraded and rebooted all three
> servers, lowest to highest, to no avail. I also attempted moving the data
> files out of the way and starting the affected vlserver, but the same
> symptoms remain.
>
> --
> Timothy Balcer / IT Services
> Telmate / San Francisco, CA
> Direct / (415) 300-4313
> Customer Service / (800) 205-5510
>



-- 
Timothy Balcer / IT Services
Telmate / San Francisco, CA
Direct / (415) 300-4313
Customer Service / (800) 205-5510

Reply via email to