Oh FYI, when I said ' some udp traffic was affected', I didn't mean blocked. I meant mangled. Some packets were being partially truncated, so looking at the headers only didn't give me any data. This was very non obvious, until I had a hint of the problem.
Best, On Sun, Jan 12, 2014 at 4:17 PM, Timothy Balcer <[email protected]> wrote: > Wow. Just as you send it out....you find the problem. Always the way :-) > > This turned out to be a subtle network problem. There was a change in the > cross colo VPN link by our provider, and while it didn't affect the > majority of our traffic, apparently -some- udp traffic was affected, due to > some encapsulation changes. > > I discovered this when I decided to bypass the network mesh entirely, just > to 'try it', and things started sync'ing again. > > I dug deeper, and there was indeed a subtle network fault. > > So.. there you go.. if anyone else is beating their heads over a Ubik > problem like this, try bypassing parts of your network to see if there is a > network fault hiding amongst the trees. > > Best, > > > On Sun, Jan 12, 2014 at 3:54 PM, Timothy Balcer <[email protected]>wrote: > >> Hey folks. Odd problem. I have gone over many things, and I am stumped. I >> am tempted to just destroy and rebuild my errant vlserver, but I'd like to >> know what's going on and I know I am missing something :-) >> >> I have three vlservers, one in each of my colos. Lets call them A >> (10.33.10.43), B(10.36.10.7) and C(10.38.10.7) >> >> A and B sync with each other >> B and C sync with each other (or try to) but fail >> A and C do not. >> >> They do know about each other, as a udebug -long shows that. However >> between A and C I see this in the long form: >> >> Server (10.38.10.7): (db 0.0) >> last vote never rcvd >> last beacon never sent >> dbcurrent=0, up=0 beaconSince=0 >> >> Same is true for C to A. >> >> The database version is completely off between A and C. >> >> This is the complete udebug from server C, which is the one that is >> acting up: >> >> Host's addresses are: 10.38.10.7 >> Host's 10.38.10.7 time is Sun Jan 12 23:04:13 2014 >> Local time is Sun Jan 12 23:04:16 2014 (time differential 3 secs) >> Last yes vote for 10.38.10.7 was 2 secs ago (not sync site); >> Last vote started 2 secs ago (at Sun Jan 12 23:04:14 2014) >> Local db version is 1388991001.15777 >> I am not sync site >> Lowest host 10.38.10.7 was set 2 secs ago >> Sync host 0.0.0.0 was set 1389567853 secs ago >> The last trans I handled was 0.46 >> Sync site's db version is 1388991001.15777 >> 0 locked pages, 0 of them for write >> >> Server (10.36.10.7): (db 0.0) >> last vote rcvd 2 secs ago (at Sun Jan 12 23:04:14 2014), >> last beacon sent 2 secs ago (at Sun Jan 12 23:04:14 2014), last vote >> was no >> dbcurrent=0, up=1 beaconSince=1 >> >> Server (10.33.10.43): (db 0.0) >> last vote never rcvd >> last beacon never sent >> dbcurrent=0, up=0 beaconSince=0 >> >> So it thinks it is not the sync site, but 0.0.0.0 is. In reality, server >> A is the sync site. >> >> I have run tcpdump to make absolutely certain there are packets running >> between the two hosts on all AFS ports, and there are. >> >> KeyFiles are fine, and checked with md5sum. >> >> Set debug to 25 on the affected vlserver (10.38.10.7, Server C) and got >> this: >> >> Sun Jan 12 23:19:24 2014 Using 10.38.10.7 as my primary address >> Sun Jan 12 23:19:40 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:19:40 2014 Starting AFS vlserver 4 >> (/usr/lib/openafs/vlserver -rxbind -d 25) >> Sun Jan 12 23:19:40 2014 no vote from 10.36.10.7 >> @(#) OpenAFS 1.6.1-1+ubuntu0.2-debian built 2013-07-24 >> 12 23:19:40 2014 Received beacon type 0 from host 10.38.10.7 >> Sun Jan 12 23:19:44 2014 recovery running in state 0 >> Sun Jan 12 23:19:55 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:19:55 2014 no vote from 10.36.10.7 >> Sun Jan 12 23:19:55 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:19:55 2014 Received beacon type 0 from host 10.38.10.7 >> Sun Jan 12 23:20:10 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:20:10 2014 no vote from 10.36.10.7 >> Sun Jan 12 23:20:10 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:20:10 2014 Received beacon type 0 from host 10.38.10.7 >> Sun Jan 12 23:20:25 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:20:25 2014 no vote from 10.36.10.7 >> Sun Jan 12 23:20:25 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:20:25 2014 Received beacon type 0 from host 10.38.10.7 >> Sun Jan 12 23:20:40 2014 ubik:server 10.33.10.43 still down >> Sun Jan 12 23:20:40 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:20:40 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:20:40 2014 no vote from 10.36.10.7 >> Sun Jan 12 23:20:40 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:20:40 2014 Received beacon type 0 from host 10.38.10.7 >> Sun Jan 12 23:20:40 2014 Ubik: vote 'yes' for 10.38.10.7 (NOT in quorum) >> Sun Jan 12 23:20:44 2014 recovery running in state 0 >> Sun Jan 12 23:20:44 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:20:48 2014 recovery running in state 0 >> Sun Jan 12 23:20:48 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:20:52 2014 recovery running in state 0 >> Sun Jan 12 23:20:52 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:20:55 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:20:55 2014 no vote from 10.36.10.7 >> Sun Jan 12 23:20:55 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:20:55 2014 Received beacon type 0 from host 10.38.10.7 >> Sun Jan 12 23:20:56 2014 recovery running in state 0 >> Sun Jan 12 23:20:56 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:21:00 2014 recovery running in state 0 >> Sun Jan 12 23:21:00 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:21:04 2014 recovery running in state 0 >> Sun Jan 12 23:21:04 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:21:07 2014 allbetter checking >> Sun Jan 12 23:21:07 2014 allbetter: returning 1 >> Sun Jan 12 23:21:07 2014 allbetter checking >> Sun Jan 12 23:21:07 2014 allbetter: returning 1 >> Sun Jan 12 23:21:07 2014 allbetter checking >> Sun Jan 12 23:21:07 2014 allbetter: returning 1 >> Sun Jan 12 23:21:07 2014 allbetter checking >> Sun Jan 12 23:21:07 2014 allbetter: returning 1 >> Sun Jan 12 23:21:07 2014 allbetter checking >> Sun Jan 12 23:21:07 2014 allbetter: returning 1 >> Sun Jan 12 23:21:07 2014 GetVolumeByID 536870913 (2) 10.38.10.83 noauth >> Sun Jan 12 23:21:07 2014 allbetter checking >> Sun Jan 12 23:21:07 2014 allbetter: returning 1 >> Sun Jan 12 23:21:07 2014 allbetter checking >> Sun Jan 12 23:21:07 2014 allbetter: returning 1 >> Sun Jan 12 23:21:07 2014 allbetter checking >> Sun Jan 12 23:21:07 2014 allbetter: returning 1 >> Sun Jan 12 23:21:07 2014 allbetter checking >> Sun Jan 12 23:21:07 2014 allbetter: returning 1 >> Sun Jan 12 23:21:07 2014 allbetter checking >> Sun Jan 12 23:21:07 2014 allbetter: returning 1 >> Sun Jan 12 23:21:07 2014 allbetter checking >> Sun Jan 12 23:21:07 2014 allbetter: returning 1 >> Sun Jan 12 23:21:07 2014 GetVolumeByID 536870913 (2) 10.38.10.83 noauth >> Sun Jan 12 23:21:07 2014 allbetter checking >> Sun Jan 12 23:21:07 2014 allbetter: returning 1 >> Sun Jan 12 23:21:07 2014 allbetter checking >> Sun Jan 12 23:21:07 2014 allbetter: returning 1 >> Sun Jan 12 23:21:07 2014 allbetter checking >> Sun Jan 12 23:21:07 2014 allbetter: returning 1 >> Sun Jan 12 23:21:07 2014 allbetter checking >> Sun Jan 12 23:21:07 2014 allbetter: returning 1 >> Sun Jan 12 23:21:07 2014 allbetter checking >> Sun Jan 12 23:21:07 2014 allbetter: returning 1 >> Sun Jan 12 23:21:08 2014 recovery running in state 0 >> Sun Jan 12 23:21:08 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:21:10 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:21:10 2014 no vote from 10.36.10.7 >> Sun Jan 12 23:21:10 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:21:10 2014 Received beacon type 0 from host 10.38.10.7 >> Sun Jan 12 23:21:12 2014 recovery running in state 0 >> Sun Jan 12 23:21:25 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:21:25 2014 no vote from 10.36.10.7 >> Sun Jan 12 23:21:25 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:21:25 2014 Received beacon type 0 from host 10.38.10.7 >> Sun Jan 12 23:21:29 2014 allbetter checking >> Sun Jan 12 23:21:29 2014 allbetter: returning 1 >> Sun Jan 12 23:21:29 2014 GetVolumeByName <snipped> >> Sun Jan 12 23:21:29 2014 allbetter checking >> Sun Jan 12 23:21:29 2014 allbetter: returning 1 >> Sun Jan 12 23:21:29 2014 allbetter checking >> Sun Jan 12 23:21:29 2014 allbetter: returning 1 >> Sun Jan 12 23:21:29 2014 allbetter checking >> Sun Jan 12 23:21:29 2014 allbetter: returning 1 >> Sun Jan 12 23:21:30 2014 allbetter checking >> Sun Jan 12 23:21:30 2014 allbetter: returning 1 >> Sun Jan 12 23:21:30 2014 GetVolumeByName *<snipped>* >> Sun Jan 12 23:21:30 2014 allbetter checking >> Sun Jan 12 23:21:30 2014 allbetter: returning 1 >> Sun Jan 12 23:21:30 2014 allbetter checking >> Sun Jan 12 23:21:30 2014 allbetter: returning 1 >> Sun Jan 12 23:21:30 2014 allbetter checking >> Sun Jan 12 23:21:30 2014 allbetter: returning 1 >> Sun Jan 12 23:21:30 2014 allbetter checking >> Sun Jan 12 23:21:30 2014 allbetter: returning 1 >> Sun Jan 12 23:21:30 2014 allbetter checking >> Sun Jan 12 23:21:30 2014 allbetter: returning 1 >> Sun Jan 12 23:21:30 2014 allbetter checking >> Sun Jan 12 23:21:30 2014 allbetter: returning 1 >> Sun Jan 12 23:21:30 2014 GetVolumeByName *<snipped>* >> Sun Jan 12 23:21:30 2014 allbetter checking >> Sun Jan 12 23:21:30 2014 allbetter: returning 1 >> Sun Jan 12 23:21:30 2014 allbetter checking >> Sun Jan 12 23:21:30 2014 allbetter: returning 1 >> Sun Jan 12 23:21:30 2014 allbetter checking >> Sun Jan 12 23:21:30 2014 allbetter: returning 1 >> Sun Jan 12 23:21:30 2014 allbetter checking >> Sun Jan 12 23:21:30 2014 allbetter: returning 1 >> Sun Jan 12 23:21:30 2014 allbetter checking >> Sun Jan 12 23:21:30 2014 allbetter: returning 1 >> Sun Jan 12 23:21:30 2014 allbetter checking >> Sun Jan 12 23:21:30 2014 allbetter: returning 1 >> Sun Jan 12 23:21:30 2014 GetVolumeByName *<snipped>* >> Sun Jan 12 23:21:30 2014 allbetter checking >> Sun Jan 12 23:21:30 2014 allbetter: returning 1 >> Sun Jan 12 23:21:30 2014 allbetter checking >> Sun Jan 12 23:21:30 2014 allbetter: returning 1 >> Sun Jan 12 23:21:30 2014 allbetter checking >> Sun Jan 12 23:21:30 2014 allbetter: returning 1 >> Sun Jan 12 23:21:40 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:21:40 2014 no vote from 10.36.10.7 >> Sun Jan 12 23:21:40 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:21:40 2014 Received beacon type 0 from host 10.38.10.7 >> Sun Jan 12 23:21:55 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:21:55 2014 no vote from 10.36.10.7 >> Sun Jan 12 23:21:55 2014 beacon: amSyncSite is 0 >> Sun Jan 12 23:21:55 2014 Received beacon type 0 from host 10.38.10.7 >> >> Any help is appreciated! I have already upgraded and rebooted all three >> servers, lowest to highest, to no avail. I also attempted moving the data >> files out of the way and starting the affected vlserver, but the same >> symptoms remain. >> >> -- >> Timothy Balcer / IT Services >> Telmate / San Francisco, CA >> Direct / (415) 300-4313 >> Customer Service / (800) 205-5510 >> > > > > -- > Timothy Balcer / IT Services > Telmate / San Francisco, CA > Direct / (415) 300-4313 > Customer Service / (800) 205-5510 > -- Timothy Balcer / IT Services Telmate / San Francisco, CA Direct / (415) 300-4313 Customer Service / (800) 205-5510
