Wow. Just as you send it out....you find the problem. Always the way :-) This turned out to be a subtle network problem. There was a change in the cross colo VPN link by our provider, and while it didn't affect the majority of our traffic, apparently -some- udp traffic was affected, due to some encapsulation changes.
I discovered this when I decided to bypass the network mesh entirely, just to 'try it', and things started sync'ing again. I dug deeper, and there was indeed a subtle network fault. So.. there you go.. if anyone else is beating their heads over a Ubik problem like this, try bypassing parts of your network to see if there is a network fault hiding amongst the trees. Best, On Sun, Jan 12, 2014 at 3:54 PM, Timothy Balcer <[email protected]> wrote: > Hey folks. Odd problem. I have gone over many things, and I am stumped. I > am tempted to just destroy and rebuild my errant vlserver, but I'd like to > know what's going on and I know I am missing something :-) > > I have three vlservers, one in each of my colos. Lets call them A > (10.33.10.43), B(10.36.10.7) and C(10.38.10.7) > > A and B sync with each other > B and C sync with each other (or try to) but fail > A and C do not. > > They do know about each other, as a udebug -long shows that. However > between A and C I see this in the long form: > > Server (10.38.10.7): (db 0.0) > last vote never rcvd > last beacon never sent > dbcurrent=0, up=0 beaconSince=0 > > Same is true for C to A. > > The database version is completely off between A and C. > > This is the complete udebug from server C, which is the one that is acting > up: > > Host's addresses are: 10.38.10.7 > Host's 10.38.10.7 time is Sun Jan 12 23:04:13 2014 > Local time is Sun Jan 12 23:04:16 2014 (time differential 3 secs) > Last yes vote for 10.38.10.7 was 2 secs ago (not sync site); > Last vote started 2 secs ago (at Sun Jan 12 23:04:14 2014) > Local db version is 1388991001.15777 > I am not sync site > Lowest host 10.38.10.7 was set 2 secs ago > Sync host 0.0.0.0 was set 1389567853 secs ago > The last trans I handled was 0.46 > Sync site's db version is 1388991001.15777 > 0 locked pages, 0 of them for write > > Server (10.36.10.7): (db 0.0) > last vote rcvd 2 secs ago (at Sun Jan 12 23:04:14 2014), > last beacon sent 2 secs ago (at Sun Jan 12 23:04:14 2014), last vote > was no > dbcurrent=0, up=1 beaconSince=1 > > Server (10.33.10.43): (db 0.0) > last vote never rcvd > last beacon never sent > dbcurrent=0, up=0 beaconSince=0 > > So it thinks it is not the sync site, but 0.0.0.0 is. In reality, server A > is the sync site. > > I have run tcpdump to make absolutely certain there are packets running > between the two hosts on all AFS ports, and there are. > > KeyFiles are fine, and checked with md5sum. > > Set debug to 25 on the affected vlserver (10.38.10.7, Server C) and got > this: > > Sun Jan 12 23:19:24 2014 Using 10.38.10.7 as my primary address > Sun Jan 12 23:19:40 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:19:40 2014 Starting AFS vlserver 4 > (/usr/lib/openafs/vlserver -rxbind -d 25) > Sun Jan 12 23:19:40 2014 no vote from 10.36.10.7 > @(#) OpenAFS 1.6.1-1+ubuntu0.2-debian built 2013-07-24 > 12 23:19:40 2014 Received beacon type 0 from host 10.38.10.7 > Sun Jan 12 23:19:44 2014 recovery running in state 0 > Sun Jan 12 23:19:55 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:19:55 2014 no vote from 10.36.10.7 > Sun Jan 12 23:19:55 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:19:55 2014 Received beacon type 0 from host 10.38.10.7 > Sun Jan 12 23:20:10 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:20:10 2014 no vote from 10.36.10.7 > Sun Jan 12 23:20:10 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:20:10 2014 Received beacon type 0 from host 10.38.10.7 > Sun Jan 12 23:20:25 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:20:25 2014 no vote from 10.36.10.7 > Sun Jan 12 23:20:25 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:20:25 2014 Received beacon type 0 from host 10.38.10.7 > Sun Jan 12 23:20:40 2014 ubik:server 10.33.10.43 still down > Sun Jan 12 23:20:40 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:20:40 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:20:40 2014 no vote from 10.36.10.7 > Sun Jan 12 23:20:40 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:20:40 2014 Received beacon type 0 from host 10.38.10.7 > Sun Jan 12 23:20:40 2014 Ubik: vote 'yes' for 10.38.10.7 (NOT in quorum) > Sun Jan 12 23:20:44 2014 recovery running in state 0 > Sun Jan 12 23:20:44 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:20:48 2014 recovery running in state 0 > Sun Jan 12 23:20:48 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:20:52 2014 recovery running in state 0 > Sun Jan 12 23:20:52 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:20:55 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:20:55 2014 no vote from 10.36.10.7 > Sun Jan 12 23:20:55 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:20:55 2014 Received beacon type 0 from host 10.38.10.7 > Sun Jan 12 23:20:56 2014 recovery running in state 0 > Sun Jan 12 23:20:56 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:21:00 2014 recovery running in state 0 > Sun Jan 12 23:21:00 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:21:04 2014 recovery running in state 0 > Sun Jan 12 23:21:04 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:21:07 2014 allbetter checking > Sun Jan 12 23:21:07 2014 allbetter: returning 1 > Sun Jan 12 23:21:07 2014 allbetter checking > Sun Jan 12 23:21:07 2014 allbetter: returning 1 > Sun Jan 12 23:21:07 2014 allbetter checking > Sun Jan 12 23:21:07 2014 allbetter: returning 1 > Sun Jan 12 23:21:07 2014 allbetter checking > Sun Jan 12 23:21:07 2014 allbetter: returning 1 > Sun Jan 12 23:21:07 2014 allbetter checking > Sun Jan 12 23:21:07 2014 allbetter: returning 1 > Sun Jan 12 23:21:07 2014 GetVolumeByID 536870913 (2) 10.38.10.83 noauth > Sun Jan 12 23:21:07 2014 allbetter checking > Sun Jan 12 23:21:07 2014 allbetter: returning 1 > Sun Jan 12 23:21:07 2014 allbetter checking > Sun Jan 12 23:21:07 2014 allbetter: returning 1 > Sun Jan 12 23:21:07 2014 allbetter checking > Sun Jan 12 23:21:07 2014 allbetter: returning 1 > Sun Jan 12 23:21:07 2014 allbetter checking > Sun Jan 12 23:21:07 2014 allbetter: returning 1 > Sun Jan 12 23:21:07 2014 allbetter checking > Sun Jan 12 23:21:07 2014 allbetter: returning 1 > Sun Jan 12 23:21:07 2014 allbetter checking > Sun Jan 12 23:21:07 2014 allbetter: returning 1 > Sun Jan 12 23:21:07 2014 GetVolumeByID 536870913 (2) 10.38.10.83 noauth > Sun Jan 12 23:21:07 2014 allbetter checking > Sun Jan 12 23:21:07 2014 allbetter: returning 1 > Sun Jan 12 23:21:07 2014 allbetter checking > Sun Jan 12 23:21:07 2014 allbetter: returning 1 > Sun Jan 12 23:21:07 2014 allbetter checking > Sun Jan 12 23:21:07 2014 allbetter: returning 1 > Sun Jan 12 23:21:07 2014 allbetter checking > Sun Jan 12 23:21:07 2014 allbetter: returning 1 > Sun Jan 12 23:21:07 2014 allbetter checking > Sun Jan 12 23:21:07 2014 allbetter: returning 1 > Sun Jan 12 23:21:08 2014 recovery running in state 0 > Sun Jan 12 23:21:08 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:21:10 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:21:10 2014 no vote from 10.36.10.7 > Sun Jan 12 23:21:10 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:21:10 2014 Received beacon type 0 from host 10.38.10.7 > Sun Jan 12 23:21:12 2014 recovery running in state 0 > Sun Jan 12 23:21:25 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:21:25 2014 no vote from 10.36.10.7 > Sun Jan 12 23:21:25 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:21:25 2014 Received beacon type 0 from host 10.38.10.7 > Sun Jan 12 23:21:29 2014 allbetter checking > Sun Jan 12 23:21:29 2014 allbetter: returning 1 > Sun Jan 12 23:21:29 2014 GetVolumeByName <snipped> > Sun Jan 12 23:21:29 2014 allbetter checking > Sun Jan 12 23:21:29 2014 allbetter: returning 1 > Sun Jan 12 23:21:29 2014 allbetter checking > Sun Jan 12 23:21:29 2014 allbetter: returning 1 > Sun Jan 12 23:21:29 2014 allbetter checking > Sun Jan 12 23:21:29 2014 allbetter: returning 1 > Sun Jan 12 23:21:30 2014 allbetter checking > Sun Jan 12 23:21:30 2014 allbetter: returning 1 > Sun Jan 12 23:21:30 2014 GetVolumeByName *<snipped>* > Sun Jan 12 23:21:30 2014 allbetter checking > Sun Jan 12 23:21:30 2014 allbetter: returning 1 > Sun Jan 12 23:21:30 2014 allbetter checking > Sun Jan 12 23:21:30 2014 allbetter: returning 1 > Sun Jan 12 23:21:30 2014 allbetter checking > Sun Jan 12 23:21:30 2014 allbetter: returning 1 > Sun Jan 12 23:21:30 2014 allbetter checking > Sun Jan 12 23:21:30 2014 allbetter: returning 1 > Sun Jan 12 23:21:30 2014 allbetter checking > Sun Jan 12 23:21:30 2014 allbetter: returning 1 > Sun Jan 12 23:21:30 2014 allbetter checking > Sun Jan 12 23:21:30 2014 allbetter: returning 1 > Sun Jan 12 23:21:30 2014 GetVolumeByName *<snipped>* > Sun Jan 12 23:21:30 2014 allbetter checking > Sun Jan 12 23:21:30 2014 allbetter: returning 1 > Sun Jan 12 23:21:30 2014 allbetter checking > Sun Jan 12 23:21:30 2014 allbetter: returning 1 > Sun Jan 12 23:21:30 2014 allbetter checking > Sun Jan 12 23:21:30 2014 allbetter: returning 1 > Sun Jan 12 23:21:30 2014 allbetter checking > Sun Jan 12 23:21:30 2014 allbetter: returning 1 > Sun Jan 12 23:21:30 2014 allbetter checking > Sun Jan 12 23:21:30 2014 allbetter: returning 1 > Sun Jan 12 23:21:30 2014 allbetter checking > Sun Jan 12 23:21:30 2014 allbetter: returning 1 > Sun Jan 12 23:21:30 2014 GetVolumeByName *<snipped>* > Sun Jan 12 23:21:30 2014 allbetter checking > Sun Jan 12 23:21:30 2014 allbetter: returning 1 > Sun Jan 12 23:21:30 2014 allbetter checking > Sun Jan 12 23:21:30 2014 allbetter: returning 1 > Sun Jan 12 23:21:30 2014 allbetter checking > Sun Jan 12 23:21:30 2014 allbetter: returning 1 > Sun Jan 12 23:21:40 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:21:40 2014 no vote from 10.36.10.7 > Sun Jan 12 23:21:40 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:21:40 2014 Received beacon type 0 from host 10.38.10.7 > Sun Jan 12 23:21:55 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:21:55 2014 no vote from 10.36.10.7 > Sun Jan 12 23:21:55 2014 beacon: amSyncSite is 0 > Sun Jan 12 23:21:55 2014 Received beacon type 0 from host 10.38.10.7 > > Any help is appreciated! I have already upgraded and rebooted all three > servers, lowest to highest, to no avail. I also attempted moving the data > files out of the way and starting the affected vlserver, but the same > symptoms remain. > > -- > Timothy Balcer / IT Services > Telmate / San Francisco, CA > Direct / (415) 300-4313 > Customer Service / (800) 205-5510 > -- Timothy Balcer / IT Services Telmate / San Francisco, CA Direct / (415) 300-4313 Customer Service / (800) 205-5510
