Thanks for the pointers. Flow control has limited impact at this point (no change under lnet_selftest and ~10% drop when disabled under iperf). All machines have tcp_sack enabled. Checksum don't seems to make a difference either. Bumping up the max_rpc_in_flights didn't improve much but seems to have made the write speed more consistent. read_ahead had no effect on read performance.
At this point I am struggling to understand what has actual effects on reads. iperf between clients and OSS gives a combined bandwidth that reach ~90% of link capacity (43.7GB/s), but lnet_selftest max out at ~14GB/s so about 28%. Any clues on what lnet tunables / settings could have any impacts here ? Best regards, Louis On 13/08/2019 12:53, Raj wrote: Louis, I would also try: - turning on selective ack (net.ipv4.tcp_sack=1) on all nodes. This helps although there is a CVE out there for older kernels. - turning off checksum osc.ostid*.checksums. This can be turned off per OST/FS on clients. - Increasing max_pages_per_rpc to 16M. Although this may not help with your reads. - Increasing max_rpcs_in_flight and max_dirty_mb be 2 x max_rpcs_in_flight - Increasing llite.ostid*.max_read_ahead_mb to up to 1024 on clients. Again this can be set per OST/FS. _Raj On Mon, Aug 12, 2019 at 12:12 PM Shawn Hall <[email protected]<mailto:[email protected]>> wrote: Do you have Ethernet flow control configured on all ports (especially the uplink ports)? We’ve found that flow control is critical when there are mismatched uplink/client port speeds. Shawn From: lustre-discuss <[email protected]<mailto:[email protected]>> On Behalf Of Louis Bailleul Sent: Monday, August 12, 2019 1:08 PM To: [email protected]<mailto:[email protected]> Subject: [lustre-discuss] Very bad lnet ethernet read performance Hi all, I am trying to understand what I am doing wrong here. I have a Lustre 2.12.1 system backed by NVME drives under zfs for which obdfilter-survey gives descent values ost 2 sz 536870912K rsz 1024K obj 2 thr 256 write 15267.49 [6580.36, 8664.20] rewrite 15225.24 [6559.05, 8900.54] read 19739.86 [9062.25, 10429.04] But my actual Lustre performances are pretty poor in comparison (can't top 8GB/s write and 13.5GB/s read) So I started to question my lnet tuning but playing with peer_credits and max_rpc_per_pages didn't help. My test setup consist of 133x10G Ethernet clients (uplinks between end devices and OSS are 2x100G for every 20 nodes). The single OSS is fitted with a bonding of 2x100G Ethernet. I have tried to understand the problem using lnet_selftest but I'll need some help/doco as this doesn't make sense to me. Testing a single 10G client [LNet Rates of lfrom] [R] Avg: 2231 RPC/s Min: 2231 RPC/s Max: 2231 RPC/s [W] Avg: 1156 RPC/s Min: 1156 RPC/s Max: 1156 RPC/s [LNet Bandwidth of lfrom] [R] Avg: 1075.16 MiB/s Min: 1075.16 MiB/s Max: 1075.16 MiB/s [W] Avg: 0.18 MiB/s Min: 0.18 MiB/s Max: 0.18 MiB/s [LNet Rates of lto] [R] Avg: 1179 RPC/s Min: 1179 RPC/s Max: 1179 RPC/s [W] Avg: 2254 RPC/s Min: 2254 RPC/s Max: 2254 RPC/s [LNet Bandwidth of lto] [R] Avg: 0.19 MiB/s Min: 0.19 MiB/s Max: 0.19 MiB/s [W] Avg: 1075.17 MiB/s Min: 1075.17 MiB/s Max: 1075.17 MiB/s With 10x10G clients : [LNet Rates of lfrom] [R] Avg: 1416 RPC/s Min: 1102 RPC/s Max: 1642 RPC/s [W] Avg: 708 RPC/s Min: 551 RPC/s Max: 821 RPC/s [LNet Bandwidth of lfrom] [R] Avg: 708.20 MiB/s Min: 550.77 MiB/s Max: 820.96 MiB/s [W] Avg: 0.11 MiB/s Min: 0.08 MiB/s Max: 0.13 MiB/s [LNet Rates of lto] [R] Avg: 7084 RPC/s Min: 7084 RPC/s Max: 7084 RPC/s [W] Avg: 14165 RPC/s Min: 14165 RPC/s Max: 14165 RPC/s [LNet Bandwidth of lto] [R] Avg: 1.08 MiB/s Min: 1.08 MiB/s Max: 1.08 MiB/s [W] Avg: 7081.86 MiB/s Min: 7081.86 MiB/s Max: 7081.86 MiB/s With all 133x10G clients: [LNet Rates of lfrom] [R] Avg: 510 RPC/s Min: 98 RPC/s Max: 23457 RPC/s [W] Avg: 510 RPC/s Min: 49 RPC/s Max: 45863 RPC/s [LNet Bandwidth of lfrom] [R] Avg: 169.87 MiB/s Min: 48.77 MiB/s Max: 341.26 MiB/s [W] Avg: 169.86 MiB/s Min: 0.01 MiB/s Max: 22757.92 MiB/s [LNet Rates of lto] [R] Avg: 23458 RPC/s Min: 23458 RPC/s Max: 23458 RPC/s [W] Avg: 45876 RPC/s Min: 45876 RPC/s Max: 45876 RPC/s [LNet Bandwidth of lto] [R] Avg: 341.12 MiB/s Min: 341.12 MiB/s Max: 341.12 MiB/s [W] Avg: 22758.42 MiB/s Min: 22758.42 MiB/s Max: 22758.42 MiB/s So if I add clients the aggregate write bandwidth somewhat stacks, but the read bandwidth decrease ??? When throwing all the nodes at the system, I am pretty happy with the ~22GB/s on write pretty as this is in the 90% of the 2x100G, but the 341MB/s read sounds very weird considering that this is a third of the performance of a single client. This are my ksocklnd tuning : # for i in /sys/module/ksocklnd/parameters/*; do echo "$i : $(cat $i)"; done /sys/module/ksocklnd/parameters/credits : 1024 /sys/module/ksocklnd/parameters/eager_ack : 0 /sys/module/ksocklnd/parameters/enable_csum : 0 /sys/module/ksocklnd/parameters/enable_irq_affinity : 0 /sys/module/ksocklnd/parameters/inject_csum_error : 0 /sys/module/ksocklnd/parameters/keepalive : 30 /sys/module/ksocklnd/parameters/keepalive_count : 5 /sys/module/ksocklnd/parameters/keepalive_idle : 30 /sys/module/ksocklnd/parameters/keepalive_intvl : 5 /sys/module/ksocklnd/parameters/max_reconnectms : 60000 /sys/module/ksocklnd/parameters/min_bulk : 1024 /sys/module/ksocklnd/parameters/min_reconnectms : 1000 /sys/module/ksocklnd/parameters/nagle : 0 /sys/module/ksocklnd/parameters/nconnds : 4 /sys/module/ksocklnd/parameters/nconnds_max : 64 /sys/module/ksocklnd/parameters/nonblk_zcack : 1 /sys/module/ksocklnd/parameters/nscheds : 12 /sys/module/ksocklnd/parameters/peer_buffer_credits : 0 /sys/module/ksocklnd/parameters/peer_credits : 128 /sys/module/ksocklnd/parameters/peer_timeout : 180 /sys/module/ksocklnd/parameters/round_robin : 1 /sys/module/ksocklnd/parameters/rx_buffer_size : 0 /sys/module/ksocklnd/parameters/sock_timeout : 50 /sys/module/ksocklnd/parameters/tx_buffer_size : 0 /sys/module/ksocklnd/parameters/typed_conns : 1 /sys/module/ksocklnd/parameters/zc_min_payload : 16384 /sys/module/ksocklnd/parameters/zc_recv : 0 /sys/module/ksocklnd/parameters/zc_recv_min_nfrags : 16 Best regards, Louis Disclaimer Please see our Privacy Notice<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.nag.co.uk_content_privacy-2Dnotice&d=DwMFaQ&c=KV_I7O14pmwRcmAVyJ1eg4Jwb8Y2JAxuL5YgMGHpjcQ&r=FTXmt89oLXmbXfP78w86-PxB1XdLYgxG8hEoAnZvCvs&m=ivu1XulCDlgfl0ZcF1MK057NBl_19awcsWYrT5l6Oc4&s=tESFOnq7ARkp3XH6U8CfNj1XnaIFjlOgyULJ0N8vyVs&e=> for information on how we process personal data. This e-mail has been scanned for all viruses and malware, and may have been automatically archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business. _______________________________________________ lustre-discuss mailing list [email protected]<mailto:[email protected]> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwMFaQ&c=KV_I7O14pmwRcmAVyJ1eg4Jwb8Y2JAxuL5YgMGHpjcQ&r=FTXmt89oLXmbXfP78w86-PxB1XdLYgxG8hEoAnZvCvs&m=ivu1XulCDlgfl0ZcF1MK057NBl_19awcsWYrT5l6Oc4&s=TUTMNQW1_S-T21CojVx-IpvMNY76NEsInuhtRTms770&e=>
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
