let's say you have a RRT of 180 ms
what you then need is your theoretical link speed  - let's say 10 Gbit/s ... easily let's take 1 GB/s

this means, you socket must be capable to take your bandwidth (data stream) during the "first" 180ms because it will take at least this time to get back the first ACKs .. .
so 1 GB / s x 0,180 s = 1024 MB/s x 0,180 s ==>> 185 MB   this means, you have to allow the operating system to accept socketsizes in that range...

set something like this - but increase these values to 185 MB
sysctl -w net.ipv4.tcp_rmem="12194304 12194304 12194304"                
sysctl -w net.ipv4.tcp_wmem="12194304 12194304 12194304"
sysctl -w net.ipv4.tcp_mem="12194304 12194304 12194304"
sysctl -w net.core.rmem_max=12194304
sysctl -w net.core.wmem_max=12194304
sysctl -w net.core.rmem_default=12194304
sysctl -w net.core.wmem_default=12194304
sysctl -w net.core.optmem_max=12194304

in addition set this :
sysctl -w net.core.netdev_max_backlog=50000
sysctl -w net.ipv4.tcp_no_metrics_save=1
sysctl -w net.ipv4.tcp_timestamps=0
sysctl -w net.ipv4.tcp_sack=1
sysctl -w net.core.netdev_max_backlog=50000
sysctl -w net.ipv4.tcp_max_syn_backlog=30000


you need to "recycle" the sockets.. means .. mmshutdown/stsartupo

should fix you issue



Mit freundlichen Grüßen / Kind regards

 
Olaf Weiser

EMEA Storage Competence Center Mainz, German / IBM Systems, Storage Platform,
-------------------------------------------------------------------------------------------------------------------------------------------
IBM Deutschland
IBM Allee 1
71139 Ehningen
Phone: +49-170-579-44-66
E-Mail: olaf.wei...@de.ibm.com
-------------------------------------------------------------------------------------------------------------------------------------------
IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter
Geschäftsführung: Martina Koederitz (Vorsitzende), Susanne Peter, Norbert Janzen, Dr. Christian Keller, Ivo Koerner, Markus Koerner
Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940




From:        Jan-Frode Myklebust <janfr...@tanso.net>
To:        "gpfsug-discuss@spectrumscale.org" <gpfsug-discuss@spectrumscale.org>
Date:        11/09/2016 07:05 PM
Subject:        Re: [gpfsug-discuss] Tuning AFM for high throughput/high IO over _really_ long distances
Sent by:        gpfsug-discuss-boun...@spectrumscale.org





Mostly curious, don't have experience in such environments, but ... Is this AFM over NFS or NSD protocol? Might be interesting to try the other option -- and also check how nsdperf performs over such distance/latency.



-jf

ons. 9. nov. 2016 kl. 18.39 skrev Jake Carroll <jake.carr...@uq.edu.au>:
Hi.

 

I’ve got an GPFS to GPFS AFM cache/home (IW) relationship set up over a really long distance. About 180ms of latency between the two clusters and around 13,000km of optical path. Fortunately for me, I’ve actually got near theoretical maximum IO over the NIC’s between the clusters and I’m iPerf’ing at around 8.90 to 9.2Gbit/sec over a 10GbE circuit. All MTU9000 all the way through.

 

Anyway – I’m finding my AFM traffic to be dragging its feet and I don’t really understand why that might be. I’ve verified the links and transports ability as I said above with iPerf, and CERN’s FDT to near 10Gbit/sec.

 

I also verified the clusters on both sides in terms of disk IO and they both seem easily capable in IOZone and IOR tests of multiple GB/sec of throughput.

 

So – my questions:

 

1.       Are there very specific tunings AFM needs for high latency/long distance IO?

2.       Are there very specific NIC/TCP-stack tunings (beyond the type of thing we already have in place) that benefits AFM over really long distances and high latency?

3.       We are seeing on the “cache” side really lazy/sticky “ls –als” in the home mount. It sometimes takes 20 to 30 seconds before the command line will report back with a long listing of files. Any ideas why it’d take that long to get a response from “home”.

 

We’ve got our TCP stack setup fairly aggressively, on all hosts that participate in these two clusters.

 

ethtool -C enp2s0f0 adaptive-rx off

ifconfig enp2s0f0 txqueuelen 10000

sysctl -w net.core.rmem_max=536870912

sysctl -w net.core.wmem_max=536870912

sysctl -w net.ipv4.tcp_rmem="4096 87380 268435456"

sysctl -w net.ipv4.tcp_wmem="4096 65536 268435456"

sysctl -w net.core.netdev_max_backlog=250000

sysctl -w net.ipv4.tcp_congestion_control=htcp

sysctl -w net.ipv4.tcp_mtu_probing=1

 

I modified a couple of small things on the AFM “cache” side to see if it’d make a difference such as:

 

mmchconfig afmNumWriteThreads=4

mmchconfig afmNumReadThreads=4

 

But no difference so far.

 

Thoughts would be appreciated. I’ve done this before over much shorter distances (30Km) and I’ve flattened a 10GbE wire without really tuning…anything. Are my large in-flight-packets numbers/long-time-to-acknowledgement semantics going to hurt here? I really thought AFM might be well designed for exactly this kind of work at long distance *and* high throughput – so I must be missing something!

 

-jc

 

 

 
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at
spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to