I don't have idea of what could be the problem, but you should try benchmarking 
your network bandwidth with lnet_selftest, with o2ib and tcp and compare the 
value. You will see if the problem is a related to Lustre network layer or 
something else.

http://wiki.lustre.org/LNET_Selftest

De : lustre-discuss <lustre-discuss-boun...@lists.lustre.org> au nom de 
Christian Kuntz <c.ku...@opendrives.com>
Date : mercredi 12 février 2020 à 04:46
À : "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
Objet : [lustre-discuss] Poor read performance when using o2ib nets over RoCE

Hello,

I've been running into a strange issue where my writes are blazingly fast (5.5 
GB/s) over RoCE with Mellanox MCX516A-CCAT cards all running together over 
o2ib, but read performance tanks to roughly 100 MB/s. During mixed read/write 
situations write performance also plummets to sub 100MB/s.

Curiously, when using tcp these problems disappear and everyone is happy, 
hovering around 1.5 GB/s read, 3 GB/s write.

I'm wondering if anyone else has run into this and what the solution may be?
My setup is:
Debian 10.3, lustre 2.13.0, zfs 0.8.2 with two OSS/OST pairs, a single mgs/mdt 
node and a single client node connected over o2ib. Everyone is cabled together 
via 100g fiber through a mellanox switch that's configured for roce and 
bonding, and they all hit about 98 Gb/s to each other via ib_send_bw, and 
simple testing of network file transfers via NFSoRDMA didn't experience the 
slowdown that lustre seems to be seeing.

I'd be happy to provide more diagnostic information if that helps, as well as 
trace information if needed.

Best,
Christian

[Image supprimée par l'expéditeur.]<https://opendrives.com/nab/>
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to