Can you share the client’s cpt configuration? $ lctl get_param cpu_partition_table cpu_partition_distance
Chris Horn From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of Gwen Dawes via lustre-discuss <lustre-discuss@lists.lustre.org> Date: Wednesday, February 14, 2024 at 11:19 AM To: lustre-discuss@lists.lustre.org <lustre-discuss@lists.lustre.org> Subject: Re: [lustre-discuss] LNet Multi-Rail config Hi Chris, Thanks for the pointers - apologies about kicking up an old thread, but I'm running out of ideas for how to solve this one. I switched everything out to 2.15.4 and carefully documented the build process, then turned one of my machines into a VM host with PCI passthrough to eliminate any NUMA issues and additional complexity. So now I have a much simpler layout - one client, multi-rail, four interfaces, with four servers (1 interface each), one of which put aside for coordinating lnet_selftest runs. Everything sees everything else as a Multi-Rail peer, with 1 interface for my servers and 4 for my client. My client has two CPTs set up - with the cards on each CPU set up into differing 'dev cpt' numbers - 0 and 1 - as per their bus. Running an lnet selftest of 'read', though (concurrency 32, simple check) - even with distribute 1:3 set, I always see exactly two interfaces in use. I can block them with net_drop_add which eventually forces the traffic off to the others, but it only ever seems to use two interfaces. Is this a bug with lnet_selftest? Some kind of non-network bottleneck? Am I misunderstanding CPTs? Gwen On Wed, 2024-01-17 at 17:53 +0000, Horn, Chris wrote: > NRS only affects Lustre traffic, so it will not factor into > lnet_selftest (LST) results. > > I gave some talks on troubleshooting multi-rail that you may want to > review. > Overview: > https://youtu.be/j3m-mznUdac?feature=shared<https://youtu.be/j3m-mznUdac?feature=shared> > Demo: > https://youtu.be/TLN56cw9Zgs?feature=shared<https://youtu.be/TLN56cw9Zgs?feature=shared> > > You should probably start by verifying that the client and server see > each other as multi-rail peers, and by checking the send and receive > counts for each interface on your client and server to ensure that > traffic is being spread across them. > > Chris Horn > > From:lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on > behalf of Gwen Dawes via lustre-discuss > <lustre-discuss@lists.lustre.org> > Date: Wednesday, January 17, 2024 at 5:48 AM > To: lustre-discuss@lists.lustre.org <lustre-discuss@lists.lustre.org> > Subject: Re: [lustre-discuss] LNet Multi-Rail config - with BODY! > Hi Andreas, > > Thanks for the pointer. I have a second server set up running 2.15.3 > as > well specifically to check this, and can set it up with > lnet_selftest, > same as the client. After taking a bit to convince the fabric manager > to accept the moved IPs, I get the exact same results between the > two. > > Good to know that it is possible, though - I wonder what needs to be > modified to achieve that. It's completely stock - the UDSP is just > blank, and the default NRS config is in play. > > I don't suppose there's any chance the NRS config is what I'm > missing? > > Gwen > > On Wed, 2024-01-17 at 03:14 +0000, Andreas Dilger wrote: > > Hello Gwen, > > I'm not a networking expert, but it seems entirely possible that > > the > > MR discovery in 2.12.9 > > isn't doing as well as what is in 2.15.3 (or 2.15.4 for that > > matter). > > It would make more sense > > to have both nodes running the same (newer) version before digging > > too deeply into this. > > > > We have definitely seen performance > 1 IB interface from a single > > node in our testing, > > though I can't say if that was done with lnet_selftest or with > > something else. > > > > Cheers, Andreas > > > > > On Jan 16, 2024, at 08:14, Gwen Dawes via lustre-discuss > > > <lustre-discuss@lists.lustre.org> wrote: > > > > > > Hi folks, > > > > > > Let's try that again. > > > > > > I'm in the luxury position of having four IB cards I'm trying to > > > squeeze the most performance out of for Lustre I can. > > > > > > I have a small test setup - two machines - a client (2.12.9) and > > > a > > > server (2.15.3) with four IB cards each. I'm able to set them up > > > as > > > Multi-Rail and each one can discover the other as such. However, > > > I > > > can't seem to get lnet_selftest to give me more speed than a > > > single > > > interface, as reported by ib_send_bw. > > > > > > Am I missing some config here? Is LNet just not capable of doing > > > more > > > than one connection per NID? > > > > > > Gwen > > > _______________________________________________ > > > lustre-discuss mailing list > > > lustre-discuss@lists.lustre.org > > > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org<http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org> > > > > Cheers, Andreas > > -- > > Andreas Dilger > > Lustre Principal Architect > > Whamcloud > > > > > > > > > > > > > > > > _______________________________________________ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org<http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org> _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org<http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org>
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org