On Jul 5, 2024, at 11:37, Michael DiDomenico via lustre-discuss <[email protected]<mailto:[email protected]>> wrote:
i could use a little help with lustre clients over omni path. when i run ib_write_bw tests between two compute nodes i get +10GB/sec. compute nodes are rhel9.4 with rhel hw drivers however, when i run lnet_selftest between the same two compute nodes 1m i/o size 16 concurrency node1-node3 read 1m i/o ~7.1GB/sec write 1m i/o ~4.7GB/sec node3-node1 read 1m i/o ~6.6GB/sec write 1m i/o ~4.9GB/sec varying the i/o size and concurrency changes the numbers, but not dramatically. i've gone through the tuning guide for omnipath and my lnd tunables all match, but i can't seem to drive the bandwidth any higher between nodes. Please provide the actual tuning parameters in use. Even when we were part of Intel, the OPA tuning parameters suggested by the OPA team were not necessarily the best in all cases. There was some kind of memory registration they kept suggesting, but it was always worse in practice than in theory. The biggest win was from setting conns_per_peer=4 or so, because OPA needs more CPU resources for good performance than IB. That said, it has been several years since I've had to deal with it, so I can't say if your current performance is good or bad.. can anyone suggest where i might be dropping some performance or is this the end? i feel like there should be more performance here, but since we recently retooled from rhel7 to rhel9, i'm unsure if there's a tunable not tuned. (unfortunately i don't have/can't seem to find previous numbers to compare) Cheers, Andreas -- Andreas Dilger Lustre Principal Architect Whamcloud
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
