Thanks, I completely missed that. Indeed the ko2iblnd parameters were different between the servers and the router. I've updated the parameters on the router to match those on the server, and things haven't gotten any better. (The problem appears to be on the Ethernet side anyway, so you've probably helped me fix a problem I didn't know I had...) I don't see much discussion about configuring lnet parameters for Ethernet networks, I assume that's using ksocklnd. On that side, it appears that all of the ksocklnd parameters match between the router and clients. Interesting that peer_timeout is 180, which is almost exactly when my client gets marked down on the router.
Server (and now router) ko2iblnd parameters: peer_credits 8 peer_credits_hiw 4 credits 256 concurrent_sends 8 ntx 512 map_on_demand 0 fmr_pool_size 512 fmr_flush_trigger 384 fmr_cache 1 Client and router ksocklnd: peer_timeout 180 peer_credits 8 keepalive 30 sock_timeout 50 credits 256 rx_buffer_size 0 tx_buffer_size 0 keepalive_idle 30 round_robin 1 sock_timeout 50 Thanks, Kevin On Mon, Oct 30, 2017 at 4:16 PM, Mohr Jr, Richard Frank (Rick Mohr) < [email protected]> wrote: > > > On Oct 30, 2017, at 8:47 AM, Kevin M. Hildebrand <[email protected]> wrote: > > > > All of the hosts (client, server, router) have the following in > ko2iblnd.conf: > > > > alias ko2iblnd-opa ko2iblnd > > options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 > concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 > fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4 > > > > install ko2iblnd /usr/sbin/ko2iblnd-probe > > Those parameters will only get applied to omnipath interfaces (which you > don’t have), so everything you have should just be running with default > parameters. Since your lnet routers have a different version of lustre > than your servers/clients, it might be possible that the default values for > the ko2iblnd parameters are different between the two versions. You can > always check this by looking at the values in the files under > /sys/module/ko2iblnd/parameters. It might be worthwhile to compare those > values on the lnet routers to the values on the servers to see if maybe > there is a difference that could affect the behavior. > > -- > Rick Mohr > Senior HPC System Administrator > National Institute for Computational Sciences > http://www.nics.tennessee.edu > >
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
