Oh FYI, when I said ' some udp traffic was affected', I didn't mean
blocked. I meant mangled. Some packets were being partially truncated, so
looking at the headers only didn't give me any data. This was very non
obvious, until I had a hint of the problem.

Best,


On Sun, Jan 12, 2014 at 4:17 PM, Timothy Balcer <[email protected]> wrote:

> Wow. Just as you send it out....you find the problem. Always the way :-)
>
> This turned out to be a subtle network problem. There was a change in the
> cross colo VPN link by our provider, and while it didn't affect the
> majority of our traffic, apparently -some- udp traffic was affected, due to
> some encapsulation changes.
>
> I discovered this when I decided to bypass the network mesh entirely, just
> to 'try it', and things started sync'ing again.
>
> I dug deeper, and there was indeed a subtle network fault.
>
> So.. there you go.. if anyone else is beating their heads over a Ubik
> problem like this, try bypassing parts of your network to see if there is a
> network fault hiding amongst the trees.
>
> Best,
>
>
> On Sun, Jan 12, 2014 at 3:54 PM, Timothy Balcer <[email protected]>wrote:
>
>> Hey folks. Odd problem. I have gone over many things, and I am stumped. I
>> am tempted to just destroy and rebuild my errant vlserver, but I'd like to
>> know what's going on and I know I am missing something :-)
>>
>> I have three vlservers, one in each of my colos. Lets call them A
>> (10.33.10.43), B(10.36.10.7) and C(10.38.10.7)
>>
>> A and B sync with each other
>> B and C sync with each other (or try to) but fail
>> A and C do not.
>>
>> They do know about each other, as a udebug -long shows that. However
>> between A and C I see this in the long form:
>>
>> Server (10.38.10.7): (db 0.0)
>>     last vote never rcvd
>>     last beacon never sent
>>     dbcurrent=0, up=0 beaconSince=0
>>
>> Same is true for C to A.
>>
>> The database version is completely off between A and C.
>>
>> This is the complete udebug from server C, which is the one that is
>> acting up:
>>
>> Host's addresses are: 10.38.10.7
>> Host's 10.38.10.7 time is Sun Jan 12 23:04:13 2014
>> Local time is Sun Jan 12 23:04:16 2014 (time differential 3 secs)
>> Last yes vote for 10.38.10.7 was 2 secs ago (not sync site);
>> Last vote started 2 secs ago (at Sun Jan 12 23:04:14 2014)
>> Local db version is 1388991001.15777
>> I am not sync site
>> Lowest host 10.38.10.7 was set 2 secs ago
>> Sync host 0.0.0.0 was set 1389567853 secs ago
>> The last trans I handled was 0.46
>> Sync site's db version is 1388991001.15777
>> 0 locked pages, 0 of them for write
>>
>> Server (10.36.10.7): (db 0.0)
>>     last vote rcvd 2 secs ago (at Sun Jan 12 23:04:14 2014),
>>     last beacon sent 2 secs ago (at Sun Jan 12 23:04:14 2014), last vote
>> was no
>>     dbcurrent=0, up=1 beaconSince=1
>>
>> Server (10.33.10.43): (db 0.0)
>>     last vote never rcvd
>>     last beacon never sent
>>     dbcurrent=0, up=0 beaconSince=0
>>
>> So it thinks it is not the sync site, but 0.0.0.0 is. In reality, server
>> A is the sync site.
>>
>> I have run tcpdump to make absolutely certain there are packets running
>> between the two hosts on all AFS ports, and there are.
>>
>> KeyFiles are fine, and checked with md5sum.
>>
>> Set debug to 25 on the affected vlserver (10.38.10.7, Server C) and got
>> this:
>>
>> Sun Jan 12 23:19:24 2014 Using 10.38.10.7 as my primary address
>> Sun Jan 12 23:19:40 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:19:40 2014 Starting AFS vlserver 4
>> (/usr/lib/openafs/vlserver -rxbind -d 25)
>> Sun Jan 12 23:19:40 2014 no vote from 10.36.10.7
>> @(#) OpenAFS 1.6.1-1+ubuntu0.2-debian built  2013-07-24
>> 12 23:19:40 2014 Received beacon type 0 from host 10.38.10.7
>> Sun Jan 12 23:19:44 2014 recovery running in state 0
>> Sun Jan 12 23:19:55 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:19:55 2014 no vote from 10.36.10.7
>> Sun Jan 12 23:19:55 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:19:55 2014 Received beacon type 0 from host 10.38.10.7
>> Sun Jan 12 23:20:10 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:20:10 2014 no vote from 10.36.10.7
>> Sun Jan 12 23:20:10 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:20:10 2014 Received beacon type 0 from host 10.38.10.7
>> Sun Jan 12 23:20:25 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:20:25 2014 no vote from 10.36.10.7
>> Sun Jan 12 23:20:25 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:20:25 2014 Received beacon type 0 from host 10.38.10.7
>> Sun Jan 12 23:20:40 2014 ubik:server 10.33.10.43 still down
>> Sun Jan 12 23:20:40 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:20:40 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:20:40 2014 no vote from 10.36.10.7
>> Sun Jan 12 23:20:40 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:20:40 2014 Received beacon type 0 from host 10.38.10.7
>> Sun Jan 12 23:20:40 2014 Ubik: vote 'yes' for 10.38.10.7 (NOT in quorum)
>> Sun Jan 12 23:20:44 2014 recovery running in state 0
>> Sun Jan 12 23:20:44 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:20:48 2014 recovery running in state 0
>> Sun Jan 12 23:20:48 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:20:52 2014 recovery running in state 0
>> Sun Jan 12 23:20:52 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:20:55 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:20:55 2014 no vote from 10.36.10.7
>> Sun Jan 12 23:20:55 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:20:55 2014 Received beacon type 0 from host 10.38.10.7
>> Sun Jan 12 23:20:56 2014 recovery running in state 0
>> Sun Jan 12 23:20:56 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:21:00 2014 recovery running in state 0
>> Sun Jan 12 23:21:00 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:21:04 2014 recovery running in state 0
>> Sun Jan 12 23:21:04 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:21:07 2014 allbetter checking
>> Sun Jan 12 23:21:07 2014 allbetter: returning 1
>> Sun Jan 12 23:21:07 2014 allbetter checking
>> Sun Jan 12 23:21:07 2014 allbetter: returning 1
>> Sun Jan 12 23:21:07 2014 allbetter checking
>> Sun Jan 12 23:21:07 2014 allbetter: returning 1
>> Sun Jan 12 23:21:07 2014 allbetter checking
>> Sun Jan 12 23:21:07 2014 allbetter: returning 1
>> Sun Jan 12 23:21:07 2014 allbetter checking
>> Sun Jan 12 23:21:07 2014 allbetter: returning 1
>> Sun Jan 12 23:21:07 2014 GetVolumeByID 536870913 (2) 10.38.10.83 noauth
>> Sun Jan 12 23:21:07 2014 allbetter checking
>> Sun Jan 12 23:21:07 2014 allbetter: returning 1
>> Sun Jan 12 23:21:07 2014 allbetter checking
>> Sun Jan 12 23:21:07 2014 allbetter: returning 1
>> Sun Jan 12 23:21:07 2014 allbetter checking
>> Sun Jan 12 23:21:07 2014 allbetter: returning 1
>> Sun Jan 12 23:21:07 2014 allbetter checking
>> Sun Jan 12 23:21:07 2014 allbetter: returning 1
>> Sun Jan 12 23:21:07 2014 allbetter checking
>> Sun Jan 12 23:21:07 2014 allbetter: returning 1
>> Sun Jan 12 23:21:07 2014 allbetter checking
>> Sun Jan 12 23:21:07 2014 allbetter: returning 1
>> Sun Jan 12 23:21:07 2014 GetVolumeByID 536870913 (2) 10.38.10.83 noauth
>> Sun Jan 12 23:21:07 2014 allbetter checking
>> Sun Jan 12 23:21:07 2014 allbetter: returning 1
>> Sun Jan 12 23:21:07 2014 allbetter checking
>> Sun Jan 12 23:21:07 2014 allbetter: returning 1
>> Sun Jan 12 23:21:07 2014 allbetter checking
>> Sun Jan 12 23:21:07 2014 allbetter: returning 1
>> Sun Jan 12 23:21:07 2014 allbetter checking
>> Sun Jan 12 23:21:07 2014 allbetter: returning 1
>> Sun Jan 12 23:21:07 2014 allbetter checking
>> Sun Jan 12 23:21:07 2014 allbetter: returning 1
>> Sun Jan 12 23:21:08 2014 recovery running in state 0
>> Sun Jan 12 23:21:08 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:21:10 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:21:10 2014 no vote from 10.36.10.7
>> Sun Jan 12 23:21:10 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:21:10 2014 Received beacon type 0 from host 10.38.10.7
>> Sun Jan 12 23:21:12 2014 recovery running in state 0
>> Sun Jan 12 23:21:25 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:21:25 2014 no vote from 10.36.10.7
>> Sun Jan 12 23:21:25 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:21:25 2014 Received beacon type 0 from host 10.38.10.7
>> Sun Jan 12 23:21:29 2014 allbetter checking
>> Sun Jan 12 23:21:29 2014 allbetter: returning 1
>> Sun Jan 12 23:21:29 2014 GetVolumeByName <snipped>
>> Sun Jan 12 23:21:29 2014 allbetter checking
>> Sun Jan 12 23:21:29 2014 allbetter: returning 1
>> Sun Jan 12 23:21:29 2014 allbetter checking
>> Sun Jan 12 23:21:29 2014 allbetter: returning 1
>> Sun Jan 12 23:21:29 2014 allbetter checking
>> Sun Jan 12 23:21:29 2014 allbetter: returning 1
>> Sun Jan 12 23:21:30 2014 allbetter checking
>> Sun Jan 12 23:21:30 2014 allbetter: returning 1
>> Sun Jan 12 23:21:30 2014 GetVolumeByName *<snipped>*
>> Sun Jan 12 23:21:30 2014 allbetter checking
>> Sun Jan 12 23:21:30 2014 allbetter: returning 1
>> Sun Jan 12 23:21:30 2014 allbetter checking
>> Sun Jan 12 23:21:30 2014 allbetter: returning 1
>> Sun Jan 12 23:21:30 2014 allbetter checking
>> Sun Jan 12 23:21:30 2014 allbetter: returning 1
>> Sun Jan 12 23:21:30 2014 allbetter checking
>> Sun Jan 12 23:21:30 2014 allbetter: returning 1
>> Sun Jan 12 23:21:30 2014 allbetter checking
>> Sun Jan 12 23:21:30 2014 allbetter: returning 1
>> Sun Jan 12 23:21:30 2014 allbetter checking
>> Sun Jan 12 23:21:30 2014 allbetter: returning 1
>> Sun Jan 12 23:21:30 2014 GetVolumeByName *<snipped>*
>> Sun Jan 12 23:21:30 2014 allbetter checking
>> Sun Jan 12 23:21:30 2014 allbetter: returning 1
>> Sun Jan 12 23:21:30 2014 allbetter checking
>> Sun Jan 12 23:21:30 2014 allbetter: returning 1
>> Sun Jan 12 23:21:30 2014 allbetter checking
>> Sun Jan 12 23:21:30 2014 allbetter: returning 1
>> Sun Jan 12 23:21:30 2014 allbetter checking
>> Sun Jan 12 23:21:30 2014 allbetter: returning 1
>> Sun Jan 12 23:21:30 2014 allbetter checking
>> Sun Jan 12 23:21:30 2014 allbetter: returning 1
>> Sun Jan 12 23:21:30 2014 allbetter checking
>> Sun Jan 12 23:21:30 2014 allbetter: returning 1
>> Sun Jan 12 23:21:30 2014 GetVolumeByName *<snipped>*
>> Sun Jan 12 23:21:30 2014 allbetter checking
>> Sun Jan 12 23:21:30 2014 allbetter: returning 1
>> Sun Jan 12 23:21:30 2014 allbetter checking
>> Sun Jan 12 23:21:30 2014 allbetter: returning 1
>> Sun Jan 12 23:21:30 2014 allbetter checking
>> Sun Jan 12 23:21:30 2014 allbetter: returning 1
>> Sun Jan 12 23:21:40 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:21:40 2014 no vote from 10.36.10.7
>> Sun Jan 12 23:21:40 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:21:40 2014 Received beacon type 0 from host 10.38.10.7
>> Sun Jan 12 23:21:55 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:21:55 2014 no vote from 10.36.10.7
>> Sun Jan 12 23:21:55 2014 beacon: amSyncSite is 0
>> Sun Jan 12 23:21:55 2014 Received beacon type 0 from host 10.38.10.7
>>
>> Any help is appreciated! I have already upgraded and rebooted all three
>> servers, lowest to highest, to no avail. I also attempted moving the data
>> files out of the way and starting the affected vlserver, but the same
>> symptoms remain.
>>
>> --
>> Timothy Balcer / IT Services
>> Telmate / San Francisco, CA
>> Direct / (415) 300-4313
>> Customer Service / (800) 205-5510
>>
>
>
>
> --
> Timothy Balcer / IT Services
> Telmate / San Francisco, CA
> Direct / (415) 300-4313
> Customer Service / (800) 205-5510
>



-- 
Timothy Balcer / IT Services
Telmate / San Francisco, CA
Direct / (415) 300-4313
Customer Service / (800) 205-5510

Reply via email to