Glad to hear you got things figured out!

-Ben

On Wed, Jan 09, 2019 at 02:26:19AM +0000, Ximeng (Simon) Guan wrote:
> Thanks. Yes, the bos invocation did hang for a minimute or two before 
> repoting that failure. 
> 
> We just figured out the reason for the failure. It is still MTU-related:
> 
> 1. Between offices we use IPsec for VPN and that limits the path MTU to be 
> 1400. 
> 2. To accommodate the reduced MTU we did the following:
>     2.1 Apply -rxmaxmtu 1400 in BosConfig
>     2.2 Adjust the ifcfg-xxx config in the host machine of the failed 
> database server to be 1400. 
> 
> It turns out that it is 2.2 that caused the problem. The database machine is 
> hosted as a KVM VM. When we adjusted the MTU of the ifcfg in the host to 1400 
> and the power outage caused the server to reboot, the server started to drop 
> incoming 1500 UDP packets. 
> 
> The server and office laptops are connected through a L2 switch that does not 
> handle fragmentation. All remote traffic goes through a L3 router which does, 
> and re-packs them to 1400. That's why all the local clients had problem 
> accessing AFS but the remote servers and clients did not... 
> 
> Thank you!
> 
> Simon
> 
> -----Original Message-----
> From: Benjamin Kaduk <[email protected]> 
> Sent: Tuesday, January 8, 2019 6:13 PM
> To: Ximeng (Simon) Guan <[email protected]>
> Cc: [email protected]
> Subject: Re: [OpenAFS] Client connection failure: bos failed to contact 
> host's bosserver (communication failure (-1))
> 
> On Mon, Jan 07, 2019 at 08:00:27PM +0000, Ximeng (Simon) Guan wrote:
> > We do have NetInfo properly set up to include the only one IP that is used. 
> 
> Good to know, thanks.
> 
> I couldn't rule out MTU issues offhand, but don't have time to dig in further 
> right now.  
> 
> Do the problematic bos invocations hang for a minute or two before reporting 
> the "communications failure"?
> 
> The bosserver listens on port 7007, if you hadn't found that already -- a 
> packet capture would help show what's going on, if you have the ability to 
> get one of those.
> 
> -Ben
> 
> > Can the connection failure somehow come from the non-default MTU settings 
> > we are using? That thing constantly bit us in the past in different places. 
> > We have  "-rxmaxmtu 1344" used across the board for all ptservers, 
> > vlserver, davolserver and dafileserver instances. I was told by the network 
> > folks that they could not manage default MTU of 1500 but has to use 1400 
> > because of the IPSec requirement...
> > 
> > Thank you!
> > Simon
> > 
> > -----Original Message-----
> > From: [email protected] <[email protected]> 
> > On Behalf Of Benjamin Kaduk
> > Sent: Monday, January 7, 2019 11:44 AM
> > To: Ximeng (Simon) Guan <[email protected]>
> > Cc: [email protected]
> > Subject: Re: [OpenAFS] Client connection failure: bos failed to 
> > contact host's bosserver (communication failure (-1))
> > 
> > On Mon, Jan 07, 2019 at 07:40:36PM +0000, Ximeng (Simon) Guan wrote:
> > > Hello,
> > > 
> > > After a power outage on Christmas Eve which forced two database servers 
> > > and all the network switches in one of our offices to re-boot, our laptop 
> > > clients in that office can no longer connect to one of the AFS servers 
> > > hosted in the same office.
> > > 
> > > I am leaning towards the possibility that it is a network problem instead 
> > > of an OpenAFS service problem because:
> > > 
> > >   1.  Remote offices can access the full AFS space, including those 
> > > volumes hosted on the re-booted servers.
> > >   2.  Between the servers there is no access problem. Nothing wrong with 
> > > the result of "bos status", "rxdebug" or "udebug". "fs checkservers" show 
> > > that all servers are running.
> > >   3.  On the problematic laptops "fs checkservers" show that "All servers 
> > > are running".
> > >   4.  On the problematic laptops "bos status afssrv1" returns a message:
> > > 
> > > "bos: failed to contact host's bosserver (communications failure (-1))."
> > > 
> > > But on the servers both in that office and in the remote offices, the 
> > > same command shows that all services are up:
> > > 
> > > "Instance ptserver, currently running normally.
> > > 
> > > Instance vlserver, currently running normally.
> > > 
> > > Instance buserver, currently running normally.
> > > 
> > > Instance upserver, currently running normally.
> > > 
> > > Instance backupusers, currently running normally.
> > > 
> > >     Auxiliary status is: run next at Tue Jan  8 04:00:00 2019.
> > > 
> > > Instance dafs, currently running normally.
> > > 
> > > Auxiliary status is: file server running."
> > > 
> > >   1.  On the problematic laptops "rxdebug afssrv1 -port 7000" returns 
> > > *normal* output, for example:
> > > 
> > > "Trying 10.12.8.33 (port 7000):
> > > 
> > > Free packets: 2073/6357, packet reclaims: 3, calls: 81, used FDs: 36
> > > 
> > > not waiting for packets.
> > > 
> > > 0 calls waiting for a thread
> > > 
> > > 125 threads are idle
> > > 
> > > 1 calls have waited for a thread
> > > 
> > > Connection from host 10.9.119.50, port 7001, Cuid ae06e5b3/70fe0104
> > > 
> > >   serial 12,  natMTU 1344, security index 0, client conn
> > > 
> > >     call 0: # 4, state dally, mode: receiving, flags: receive_done
> > > 
> > >     call 1: # 0, state not initialized
> > > 
> > >     call 2: # 0, state not initialized
> > > 
> > >     call 3: # 0, state not initialized
> > > 
> > > Connection from host 10.12.4.74, port 7001, Cuid ae06e5b3/70fe0114
> > > 
> > >   serial 21,  natMTU 1344, security index 0, client conn
> > > 
> > >     call 0: # 7, state dally, mode: receiving, flags: receive_done
> > > 
> > >     call 1: # 0, state not initialized
> > > 
> > >     call 2: # 0, state not initialized
> > > 
> > >     call 3: # 0, state not initialized
> > > 
> > > Done."
> > > 
> > > I do not administer the network. Can I have some advice on how to futher 
> > > debug the connection problem? Which udp port does the command "bos 
> > > status" use?
> > 
> > My instinct would be that there is some multihoming going on and that 
> > http://docs.openafs.org/Reference/5/NetRestrict.html and/or 
> > http://docs.openafs.org/Reference/5/NetInfo.html are not properly 
> > configured.
> > 
> > -Ben
> > _______________________________________________
> > OpenAFS-info mailing list
> > [email protected]
> > https://lists.openafs.org/mailman/listinfo/openafs-info
> > _______________________________________________
> > OpenAFS-info mailing list
> > [email protected]
> > https://lists.openafs.org/mailman/listinfo/openafs-info
_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to