Hello,

I agree as others have stated that we would not expect the loss of a router to 
significantly affect the I/O destined for filesystems served by other routers, 
nor would we expect the I/O destined for non-routed filesystems to be affected. 
However, I can say that we have seen bugs in this area in the past where the 
loss of a remote filesystem (the servers, not the routers serving that 
filesystem) did affect access to other filesystems. If I recall correctly the 
issue was that resources were being consumed on the routers in trying to 
communicate with the lost filesystem. That resource consumption caused I/O 
destined for other filesystems to get backed up. I’m not aware of any 
outstanding issues like this, and I’ll stress that that sort of behavior would 
certainly be considered a bug. So please let us know if you see any issues.

Regarding check_routers_before_use, this parameter affects how the LNet router 
checker behaves upon startup. The router checker on an LNet peer works by 
periodically sending an LNet ping to each known router. If a peer receives a 
response from the router within a timeout period then the router is considered 
alive, otherwise it is considered dead and routes hosted by that router are 
removed from the routing table (until it starts responding to the pings). By 
default, all routers are initially considered to be up (alive), and all routes 
are immediately eligible for sends. When check_routers_before_use is enabled 
(set to “1”) all routers are instead initially considered down (dead), and all 
routes must first respond to an LNet level ping before the route becomes 
eligible for sends.

The use of this parameter should not affect the scenarios you describe. Traffic 
destined for local networks is not affected by the up or down (alive or dead) 
states of routers.

Chris Horn

From: lustre-discuss <[email protected]> on behalf of 
Makia Minich <[email protected]>
Date: Wednesday, May 9, 2018 at 8:51 AM
To: "[email protected]" <[email protected]>
Subject: [lustre-discuss] LNET Routing Question

Hello all,

I have an LNET routing question. I’ve attached a quick diagram of the current 
setup; but basically I have two core networks (one infiniband and one ethernet) 
with a set of LNET routers in between. There is storage and clients on both 
sides of these routers and all clients need to see all/most storage. All 
connections, configurations, etc are all working.

The question is, if an LNET router goes down (which does cause some amount of 
reconnect or remapping for any clients attempting to use those routes) would 
this cause any issues or delays for a client’s connection to non-routed 
storage? Put slightly different, if a job on the ethernet clients is actively 
using ethernet storage and the lnet routers go down, will job be affected? What 
about a new job just launching when that lnet router is down?

In addition, what does “check_routers_before_use” actually do and does it 
change the scenarios I mentioned? (e.g. If an ethernet client has 
“check_routers_before_use” would every file request start with a ping to the 
routers even if it’s not leaving it’s core network?)

Thanks!

[cid:[email protected]]
—

Makia Minich
Principal Architect
System Fabric Works
"Fabric Computing that Works”

"Oh, I don't know. I think everything is just as it should be, y'know?”
- Frank Fairfield

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to