I received a reply from Alejandro suggesting that I check live_router_check_interval, dead_router_check_interval and router_ping_timeout. I had those set to the defaults, which I assume are 60, 60, and 50 seconds respectively. I did just try setting those values explicitly, and I'm not seeing any better behavior. >From watching /proc/sys/lnet/routers on the client, I see that the client is indeed sending router pings every 60 seconds. On the router itself, watching /proc/sys/lnet/peers immediately after doing 'lctl net down; lctl net up', I see the 'last' column for my test client count from 0 up to around 180, at which point the client is marked 'down'. (For the other peers, all of which are servers, the values count from 0 to around 180 and then reset to 0, remaining 'up') Is the 'last' column reflecting the last time the router has received a 'ping' from that peer? If so, why do the numbers count to 180 instead of 60, which is the frequency they're being sent?
Thanks, Kevin On Mon, Oct 30, 2017 at 8:47 AM, Kevin M. Hildebrand <[email protected]> wrote: > Hello, I'm trying to set up some new Lustre routers between a set of > Infiniband connected Lustre servers and a few hosts connected to an > external 100G Ethernet network. The problem I'm having is that the > routers work just fine for a minute or two, and then shortly thereafter > they're marked as 'down' and all traffic stops. If I unload/reload the > lustre modules on the router, it'll work again for a short time and then > stop again. The router shows errors like: > [236528.801275] LNetError: 54389:0:(lib-move.c:2120:lnet_parse_get()) > 10.10.104.2@tcp2: Unable to send REPLY for GET from > 12345-10.10.104.201@tcp2: -113 > > My Lustre router has a Mellanox ConnectX-3 interface connecting to the > Lustre servers, and a Mellanox ConnectX-5 > 100G > interface connecting to a 100G switch to which my test client is connected. > > On the Infiniband side, I've got > lnet > configured as o2ib1 > > , and on the Ethernet side, as tcp2. > > Clients and servers are all running Lustre 2.8. The Lustre router at the > moment is running Lustre 2.10.1, because of software dependencies to > support the 100G card. > > I've verified that I have stable network connectivity on both the IB and > Ethernet sides. > > At the moment, I have very simple lnet configurations, using the built in > defaults. lnet.conf on the server: > options lnet ip2nets="o2ib1(ib0) 192.168.[64-95].*; tcp1 > 10.103.[128-159].*" routes="tcp0 192.168.64.[78-79]@o2ib1; tcp2 > 192.168.64.[78-79]@o2ib1" > > On the lustre router: > options lnet networks="o2ib1(ib0),tcp2(p1p1.104)" "forwarding=enabled" > > And on the client: > options lnet networks="tcp2(p4p1.104)" routes="o2ib1 10.10.104.[2-3]@tcp2" > > All of the hosts (client, server, router) have the following in > ko2iblnd.conf: > > alias ko2iblnd-opa ko2iblnd > options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 > concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 > fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4 > > install ko2iblnd /usr/sbin/ko2iblnd-probe > > > Does anyone see anything I've missed, or have any thoughts on where I > should look next? > > Thanks, > Kevin > > -- > Kevin Hildebrand > University of Maryland, College Park >
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
