On 9/28/17 8:29 PM, Dilger, Andreas wrote: > Riccardo, > I'm not an LNet expert, but a number of LNet multi-rail fixes are landed or > being worked on for Lustre 2.10.1. You might try testing the current b2_10 > to see if that resolves your problems. You are right I might end up with that. Sorry but I did not understand if 2.10.1 is officially out or if it is release candidate. thanks > > Cheers, Andreas > > On Sep 27, 2017, at 21:22, Riccardo Veraldi <riccardo.vera...@cnaf.infn.it> > wrote: >> Hello. >> >> I configure Multi-rail on my lustre environment. >> >> MDS: 172.21.42.213@tcp >> OSS: 172.21.52.118@o2ib >> 172.21.52.86@o2ib >> Client: 172.21.52.124@o2ib >> 172.21.52.125@o2ib >> >> >> [root@drp-tst-oss10:~]# cat /proc/sys/lnet/peers >> nid refs state last max rtr min tx min >> queue >> 172.21.52.124@o2ib 1 NA -1 128 128 128 128 128 0 >> 172.21.52.125@o2ib 1 NA -1 128 128 128 128 128 0 >> 172.21.42.213@tcp 1 NA -1 8 8 8 8 6 0 >> >> after configuring multi-rail I can see both infiniband interfaces peers on >> the OSS and on the client side. >> Anyway before multi-rail lustre client could mount the lustre FS without >> problems. >> Now after multi-rail is set up the client cannot mount anymore the >> filesystem. >> >> When I mount lustre from the client (fstab entry): >> >> 172.21.42.213@tcp:/drplu /drplu lustre noauto,lazystatfs,flock, 0 0 >> >> the file system cannot be mounted and I got these errors >> >> Sep 27 18:28:46 drp-tst-lu10 kernel: [ 596.842861] Lustre: >> 2490:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has >> failed due to network error: [sent 1506562126/real 1506562126] >> req@ffff8808326b2a00 x1579744801849904/t0(0) >> o400-> >> drplu-OST0001-osc-ffff88085d134800@172.21.52.86@o2ib:28/4 >> lens >> 224/224 e 0 to 1 dl 1506562133 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 >> Sep 27 18:28:46 drp-tst-lu10 kernel: [ 596.842872] Lustre: >> drplu-OST0001-osc-ffff88085d134800: Connection to drplu-OST0001 (at >> 172.21.52.86@o2ib) was lost; in progress operations using this service >> will wait for recovery to complete >> Sep 27 18:28:46 drp-tst-lu10 kernel: [ 596.843306] Lustre: >> drplu-OST0001-osc-ffff88085d134800: Connection restored to >> 172.21.52.86@o2ib (at 172.21.52.86@o2ib) >> >> >> the mount point appears and disappears every few seconds from "df" >> >> I do not have a clue on how to fix. The multi rail capability is important >> for me. >> >> I have Lustre 2.10.0 both client side and server side. >> here is my lnet.conf on the lustre client side. The one OSS side is >> similar just swapped peers for o2ib net. >> >> net: >> - net type: lo >> local NI(s): >> - nid: 0@lo >> status: up >> statistics: >> send_count: 0 >> recv_count: 0 >> drop_count: 0 >> tunables: >> peer_timeout: 0 >> peer_credits: 0 >> peer_buffer_credits: 0 >> credits: 0 >> lnd tunables: >> tcp bonding: 0 >> dev cpt: 0 >> CPT: "[0]" >> - net type: o2ib >> local NI(s): >> - nid: 172.21.52.124@o2ib >> status: up >> interfaces: >> 0: ib0 >> statistics: >> send_count: 7 >> recv_count: 7 >> drop_count: 0 >> tunables: >> peer_timeout: 180 >> peer_credits: 128 >> peer_buffer_credits: 0 >> credits: 1024 >> lnd tunables: >> peercredits_hiw: 64 >> map_on_demand: 32 >> concurrent_sends: 256 >> fmr_pool_size: 2048 >> fmr_flush_trigger: 512 >> fmr_cache: 1 >> ntx: 2048 >> conns_per_peer: 4 >> tcp bonding: 0 >> dev cpt: -1 >> CPT: "[0]" >> - nid: 172.21.52.125@o2ib >> status: up >> interfaces: >> 0: ib1 >> statistics: >> send_count: 5 >> recv_count: 5 >> drop_count: 0 >> tunables: >> peer_timeout: 180 >> peer_credits: 128 >> peer_buffer_credits: 0 >> credits: 1024 >> lnd tunables: >> peercredits_hiw: 64 >> map_on_demand: 32 >> concurrent_sends: 256 >> fmr_pool_size: 2048 >> fmr_flush_trigger: 512 >> fmr_cache: 1 >> ntx: 2048 >> conns_per_peer: 4 >> tcp bonding: 0 >> dev cpt: -1 >> CPT: "[0]" >> - net type: tcp >> local NI(s): >> - nid: 172.21.42.195@tcp >> status: up >> interfaces: >> 0: enp7s0f0 >> statistics: >> send_count: 51 >> recv_count: 51 >> drop_count: 0 >> tunables: >> peer_timeout: 180 >> peer_credits: 8 >> peer_buffer_credits: 0 >> credits: 256 >> lnd tunables: >> tcp bonding: 0 >> dev cpt: -1 >> CPT: "[0]" >> peer: >> - primary nid: 172.21.42.213@tcp >> Multi-Rail: False >> peer ni: >> - nid: 172.21.42.213@tcp >> state: NA >> max_ni_tx_credits: 8 >> available_tx_credits: 8 >> min_tx_credits: 6 >> tx_q_num_of_buf: 0 >> available_rtr_credits: 8 >> min_rtr_credits: 8 >> send_count: 0 >> recv_count: 0 >> drop_count: 0 >> refcount: 1 >> - primary nid: 172.21.52.86@o2ib >> Multi-Rail: True >> peer ni: >> - nid: 172.21.52.86@o2ib >> state: NA >> max_ni_tx_credits: 128 >> available_tx_credits: 128 >> min_tx_credits: 128 >> tx_q_num_of_buf: 0 >> available_rtr_credits: 128 >> min_rtr_credits: 128 >> send_count: 0 >> recv_count: 0 >> drop_count: 0 >> refcount: 1 >> - nid: 172.21.52.118@o2ib >> state: NA >> max_ni_tx_credits: 128 >> available_tx_credits: 128 >> min_tx_credits: 128 >> tx_q_num_of_buf: 0 >> available_rtr_credits: 128 >> min_rtr_credits: 128 >> send_count: 0 >> recv_count: 0 >> drop_count: 0 >> refcount: 1 >> >> thank you very much for any hint you may give. >> >> Rick >> >> >> _______________________________________________ >> lustre-discuss mailing list >> lustre-discuss@lists.lustre.org >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > Cheers, Andreas > -- > Andreas Dilger > Lustre Principal Architect > Intel Corporation > > > > > > > >
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org