On Fri, 09 Mar 2012 15:46:03 +0100 ProbaNet <[email protected]> wrote:
> > If you want to quickly test this, you can run udebug against each of > > the dbservers from each of the other dbservers. If beacons are > > getting through but not database updates, though, maybe there could > > be an issue where only packets over a certain size aren't getting > > through or something, hmm... > > Yes, I fear so.. We did all the udebug tests (from all servers to all > servers) and everithing was ok. Everything you've said so far makes it sound like _something_ is getting dropped along the way, due to the rather broad nature of the problems/timeouts. And I take it the network layout you have for this is not exactly the simplest thing, is that correct? There's some firewalls and VPNs and maybe NATs getting traversed; is that right? My guess would be fragmented packets are getting lost, or maybe just packets over a certain size. So, let's look at a couple of things. First, some stats. Let's get the output from 'rxdebug <server> 7003 -rxstat -noconn', for at least afsmn1, and perhaps all of the other vlservers while we're at it. Also, just to get data from the variety of ways this seems to manifest, get the same data for the volserver 'vos release'-ish hangs. Run 'rxdebug <server> 7005 -rxstat -noconn' for afsrm1 and afsmn5 (the two servers involved in the hanging 'vos release' example you mentioned before). In addition to all that, you can run a more 'real' test of connectivity between the servers with rxperf. If you don't have rxperf, you can build it from source by building the tree, and then going into src/rx (1.4.x) or src/rx/test (1.6.x and beyond, I think) and running 'make rxperf'. On one server, run 'rxperf server -p 12345'. On the other, run: rxperf client -c send -b 1048576 -T 5 -p 12345 -s <otherserver> rxperf client -c recv -b 1048576 -T 5 -p 12345 -s <otherserver> and see if you can get it to hang. If it succeeds, it should print out some stats and exit after transferring 5M. If you want it to transfer more and run a bit longer, just up the -T parameter. If you can get it to hang, capture a network dump for UDP port 12345 on each server. If you cannot get it to hang, we can try looking at a network capture for the 'real' communication of vlserver or volserver, etc, when those hang, but it seems better to get it to happen with a test scenario first, if possible. > > You mention above there are at least _some_ messages in the log. > > What are they? > > All messages was kind of 'remote server X voted "yes" on date Y', > 'Received beacon=1 from server X', etc etc.. Nothing strange (debug > level 125). Ah okay, at debug level 125, sure. I thought you meant you were seeing log messages at log level 0, which I would be much more suspicious about :) -- Andrew Deason [email protected] _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
