Re: [gpfsug-discuss] Tuning AFM for high throughput/high IO over _really_ long distances

2016-11-09 Thread Olaf Weiser
let's say you have a RRT of 180 ms what you then need is your theoretical
link speed  - let's say 10 Gbit/s ... easily let's take 1 GB/sthis means, you socket must be capable
to take your bandwidth (data stream) during the "first" 180ms
because it will take at least this time to get back the first ACKs .. .so 1 GB / s x 0,180 s = 1024 MB/s x
0,180 s ==>> 185 MB   this means, you have to allow the operating
system to accept socketsizes in that range... set something like this - but increase
these values to 185 MBsysctl -w net.ipv4.tcp_rmem="12194304
12194304 12194304"            
   sysctl -w net.ipv4.tcp_wmem="12194304
12194304 12194304"sysctl -w net.ipv4.tcp_mem="12194304
12194304 12194304"sysctl -w net.core.rmem_max=12194304sysctl -w net.core.wmem_max=12194304sysctl -w net.core.rmem_default=12194304sysctl -w net.core.wmem_default=12194304sysctl -w net.core.optmem_max=12194304in addition set this :sysctl -w net.core.netdev_max_backlog=5sysctl -w net.ipv4.tcp_no_metrics_save=1sysctl -w net.ipv4.tcp_timestamps=0sysctl -w net.ipv4.tcp_sack=1sysctl -w net.core.netdev_max_backlog=5sysctl -w net.ipv4.tcp_max_syn_backlog=3you need to "recycle" the
sockets.. means .. mmshutdown/stsartuposhould fix you issueMit freundlichen Grüßen / Kind regards Olaf Weiser EMEA Storage Competence Center Mainz, German / IBM Systems, Storage Platform,---IBM DeutschlandIBM Allee 171139 EhningenPhone: +49-170-579-44-66E-Mail: olaf.wei...@de.ibm.com---IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin JetterGeschäftsführung: Martina Koederitz (Vorsitzende), Susanne Peter, Norbert
Janzen, Dr. Christian Keller, Ivo Koerner, Markus KoernerSitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart,
HRB 14562 / WEEE-Reg.-Nr. DE 99369940 From:      
 Jan-Frode Myklebust
To:      
 "gpfsug-discuss@spectrumscale.org"
Date:      
 11/09/2016 07:05 PMSubject:    
   Re: [gpfsug-discuss]
Tuning AFM for high throughput/high IO over _really_ long distancesSent by:    
   gpfsug-discuss-boun...@spectrumscale.orgMostly curious, don't have experience in such environments, but ... Is
this AFM over NFS or NSD protocol? Might be interesting to try the other
option -- and also check how nsdperf performs over such distance/latency.-jfons. 9. nov. 2016 kl. 18.39 skrev Jake Carroll :Hi. I’ve got an GPFS to GPFS AFM cache/home (IW) relationship
set up over a really long distance. About 180ms of latency between the
two clusters and around 13,000km of optical path. Fortunately for me, I’ve
actually got near theoretical maximum IO over the NIC’s between the clusters
and I’m iPerf’ing at around 8.90 to 9.2Gbit/sec over a 10GbE circuit.
All MTU9000 all the way through. Anyway – I’m finding my AFM traffic to be dragging its
feet and I don’t really understand why that might be. I’ve verified the
links and transports ability as I said above with iPerf, and CERN’s FDT
to near 10Gbit/sec.  I also verified the clusters on both sides in terms of
disk IO and they both seem easily capable in IOZone and IOR tests of multiple
GB/sec of throughput. So – my questions: 1.   Are there very specific
tunings AFM needs for high latency/long distance IO? 2.   Are there very specific
NIC/TCP-stack tunings (beyond the type of thing we already have in place)
that benefits AFM over really long distances and high latency?3.   We are seeing on
the “cache” side really lazy/sticky “ls –als” in the home mount. It
sometimes takes 20 to 30 seconds before the command line will report back
with a long listing of files. Any ideas why it’d take that long to get
a response from “home”. We’ve got our TCP stack setup fairly aggressively, on
all hosts that participate in these two clusters. ethtool -C enp2s0f0 adaptive-rx offifconfig enp2s0f0 txqueuelen 1sysctl -w net.core.rmem_max=536870912sysctl -w net.core.wmem_max=536870912sysctl -w net.ipv4.tcp_rmem="4096 87380 268435456"sysctl -w net.ipv4.tcp_wmem="4096 65536 268435456"sysctl -w net.core.netdev_max_backlog=25sysctl -w net.ipv4.tcp_congestion_control=htcpsysctl -w net.ipv4.tcp_mtu_probing=1 I modified a couple of small things on the AFM “cache”
side to see if it’d make a difference such as: mmchconfig afmNumWriteThreads=4mmchconfig afmNumReadThreads=4 But no difference so far. Thoughts would be appreciated. I’ve done this before over
much shorter distances (30Km) and I’ve flattened a 10GbE wire without
really tuning…anything. Are my large in-flight-packets numbers/long-time-to-acknowledgement
semantics going to hurt here? I really thought AFM might be well designed
for exactly this kind of work at long distance *and* high throughput
– so I must 

Re: [gpfsug-discuss] Tuning AFM for high throughput/high IO over _really_ long distances (Jan-Frode Myklebust)

2016-11-09 Thread Scott Fadden
So you are using the NSD protocol for data transfers over multi-cluster? 
If so the TCP and thread tuning should help as well. 


Scott Fadden
Spectrum Scale - Technical Marketing 
Phone: (503) 880-5833 
sfad...@us.ibm.com
http://www.ibm.com/systems/storage/spectrum/scale



From:   Jake Carroll <jake.carr...@uq.edu.au>
To: "gpfsug-discuss@spectrumscale.org" 
<gpfsug-discuss@spectrumscale.org>
Date:   11/09/2016 10:09 AM
Subject:    Re: [gpfsug-discuss] Tuning AFM for high throughput/high 
IO over _really_ long distances (Jan-Frode Myklebust)
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Hi jf…

 
>>Mostly curious, don't have experience in such environments, but ... 
Is this
AFM over NFS or NSD protocol? Might be interesting to try the other 
option
-- and also check how nsdperf performs over such distance/latency.
 
As it turns out, it seems, very few people do. 

I will test nsdperf over it and see how it performs. And yes, it is AFM → 
AFM. No NFS involved here!

-jc


 
--
 
Message: 2
Date: Wed, 9 Nov 2016 17:39:05 +
From: Jake Carroll <jake.carr...@uq.edu.au>
To: "gpfsug-discuss@spectrumscale.org"
 <gpfsug-discuss@spectrumscale.org>
Subject: [gpfsug-discuss] Tuning AFM for high throughput/high IO over
 _really_ long distances
Message-ID: <83652c3d-0802-4cc2-b636-9faa31ef5...@uq.edu.au>
Content-Type: text/plain; charset="utf-8"
 
Hi.
 
I?ve got an GPFS to GPFS AFM cache/home (IW) relationship set up over 
a really long distance. About 180ms of latency between the two clusters 
and around 13,000km of optical path. Fortunately for me, I?ve actually got 
near theoretical maximum IO over the NIC?s between the clusters and I?m 
iPerf?ing at around 8.90 to 9.2Gbit/sec over a 10GbE circuit. All MTU9000 
all the way through.
 
Anyway ? I?m finding my AFM traffic to be dragging its feet and I 
don?t really understand why that might be. I?ve verified the links and 
transports ability as I said above with iPerf, and CERN?s FDT to near 
10Gbit/sec.
 
I also verified the clusters on both sides in terms of disk IO and 
they both seem easily capable in IOZone and IOR tests of multiple GB/sec 
of throughput.
 
So ? my questions:
 
 
1.   Are there very specific tunings AFM needs for high 
latency/long distance IO?
 
2.   Are there very specific NIC/TCP-stack tunings (beyond the 
type of thing we already have in place) that benefits AFM over really long 
distances and high latency?
 
3.   We are seeing on the ?cache? side really lazy/sticky ?ls 
?als? in the home mount. It sometimes takes 20 to 30 seconds before the 
command line will report back with a long listing of files. Any ideas why 
it?d take that long to get a response from ?home?.
 
We?ve got our TCP stack setup fairly aggressively, on all hosts that 
participate in these two clusters.
 
ethtool -C enp2s0f0 adaptive-rx off
ifconfig enp2s0f0 txqueuelen 1
sysctl -w net.core.rmem_max=536870912
sysctl -w net.core.wmem_max=536870912
sysctl -w net.ipv4.tcp_rmem="4096 87380 268435456"
sysctl -w net.ipv4.tcp_wmem="4096 65536 268435456"
sysctl -w net.core.netdev_max_backlog=25
sysctl -w net.ipv4.tcp_congestion_control=htcp
sysctl -w net.ipv4.tcp_mtu_probing=1
 
I modified a couple of small things on the AFM ?cache? side to see if 
it?d make a difference such as:
 
mmchconfig afmNumWriteThreads=4
mmchconfig afmNumReadThreads=4
 
But no difference so far.
 
Thoughts would be appreciated. I?ve done this before over much shorter 
distances (30Km) and I?ve flattened a 10GbE wire without really 
tuning?anything. Are my large in-flight-packets 
numbers/long-time-to-acknowledgement semantics going to hurt here? I 
really thought AFM might be well designed for exactly this kind of work at 
long distance *and* high throughput ? so I must be missing something!
 
-jc
 
 
 
-- next part --
An HTML attachment was scrubbed...
URL: <
http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20161109/d4f4d9a7/attachment-0001.html
>
 
--
 
Message: 3
Date: Wed, 09 Nov 2016 18:05:21 +
From: Jan-Frode Myklebust <janfr...@tanso.net>
To: "gpfsug-discuss@spectrumscale.org"
     <gpfsug-discuss@spectrumscale.org>
Subject: Re: [gpfsug-discuss] Tuning AFM for high throughput/high IO
 over _really_ long distances
Message-ID:
 

Re: [gpfsug-discuss] Tuning AFM for high throughput/high IO over _really_ long distances (Jan-Frode Myklebust)

2016-11-09 Thread Jake Carroll
Hi jf…


>>Mostly curious, don't have experience in such environments, but ... Is 
>> this
AFM over NFS or NSD protocol? Might be interesting to try the other option
-- and also check how nsdperf performs over such distance/latency.

As it turns out, it seems, very few people do. 

I will test nsdperf over it and see how it performs. And yes, it is AFM → AFM. 
No NFS involved here!

-jc



--

Message: 2
Date: Wed, 9 Nov 2016 17:39:05 +
From: Jake Carroll <jake.carr...@uq.edu.au>
To: "gpfsug-discuss@spectrumscale.org"
<gpfsug-discuss@spectrumscale.org>
Subject: [gpfsug-discuss] Tuning AFM for high throughput/high IO over
_really_ long distances
Message-ID: <83652c3d-0802-4cc2-b636-9faa31ef5...@uq.edu.au>
Content-Type: text/plain; charset="utf-8"

Hi.

I?ve got an GPFS to GPFS AFM cache/home (IW) relationship set up over a 
really long distance. About 180ms of latency between the two clusters and 
around 13,000km of optical path. Fortunately for me, I?ve actually got near 
theoretical maximum IO over the NIC?s between the clusters and I?m iPerf?ing at 
around 8.90 to 9.2Gbit/sec over a 10GbE circuit. All MTU9000 all the way 
through.

Anyway ? I?m finding my AFM traffic to be dragging its feet and I don?t 
really understand why that might be. I?ve verified the links and transports 
ability as I said above with iPerf, and CERN?s FDT to near 10Gbit/sec.

I also verified the clusters on both sides in terms of disk IO and they 
both seem easily capable in IOZone and IOR tests of multiple GB/sec of 
throughput.

So ? my questions:


1.   Are there very specific tunings AFM needs for high latency/long 
distance IO?

2.   Are there very specific NIC/TCP-stack tunings (beyond the type of 
thing we already have in place) that benefits AFM over really long distances 
and high latency?

3.   We are seeing on the ?cache? side really lazy/sticky ?ls ?als? in 
the home mount. It sometimes takes 20 to 30 seconds before the command line 
will report back with a long listing of files. Any ideas why it?d take that 
long to get a response from ?home?.

We?ve got our TCP stack setup fairly aggressively, on all hosts that 
participate in these two clusters.

ethtool -C enp2s0f0 adaptive-rx off
ifconfig enp2s0f0 txqueuelen 1
sysctl -w net.core.rmem_max=536870912
sysctl -w net.core.wmem_max=536870912
sysctl -w net.ipv4.tcp_rmem="4096 87380 268435456"
sysctl -w net.ipv4.tcp_wmem="4096 65536 268435456"
sysctl -w net.core.netdev_max_backlog=25
sysctl -w net.ipv4.tcp_congestion_control=htcp
sysctl -w net.ipv4.tcp_mtu_probing=1

I modified a couple of small things on the AFM ?cache? side to see if it?d 
make a difference such as:

mmchconfig afmNumWriteThreads=4
mmchconfig afmNumReadThreads=4

But no difference so far.

Thoughts would be appreciated. I?ve done this before over much shorter 
distances (30Km) and I?ve flattened a 10GbE wire without really 
tuning?anything. Are my large in-flight-packets 
numbers/long-time-to-acknowledgement semantics going to hurt here? I really 
thought AFM might be well designed for exactly this kind of work at long 
distance *and* high throughput ? so I must be missing something!

-jc



-- next part --
An HTML attachment was scrubbed...
URL: 
<http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20161109/d4f4d9a7/attachment-0001.html>

--

Message: 3
Date: Wed, 09 Nov 2016 18:05:21 +
From: Jan-Frode Myklebust <janfr...@tanso.net>
To: "gpfsug-discuss@spectrumscale.org"
    <gpfsug-discuss@spectrumscale.org>
    Subject: Re: [gpfsug-discuss] Tuning AFM for high throughput/high IO
over _really_ long distances
Message-ID:

Re: [gpfsug-discuss] Tuning AFM for high throughput/high IO over _really_ long distances

2016-11-09 Thread Scott Fadden
Jake,

If AFM is using NFS it is all about NFS tuning. The copy from one side to 
the other is basically just a client writing to an NFS mount. Thee are a 
few things you can look at:
1. NFS Transfer size (Make is 1MiB, I think that is the max)
2. TCP Tuning for large window size. This is discussed on Tuning active 
file management home communications in the docs. On this page you will 
find some discussion on increasing gateway threads, and other things 
similar that may help as well.

We can discuss further as I understand we will be meeting at SC16.

Scott Fadden
Spectrum Scale - Technical Marketing 
Phone: (503) 880-5833 
sfad...@us.ibm.com
http://www.ibm.com/systems/storage/spectrum/scale



From:   Jake Carroll 
To: "gpfsug-discuss@spectrumscale.org" 

Date:   11/09/2016 09:39 AM
Subject:[gpfsug-discuss] Tuning AFM for high throughput/high IO 
over_really_ long distances
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Hi.
 
I’ve got an GPFS to GPFS AFM cache/home (IW) relationship set up over a 
really long distance. About 180ms of latency between the two clusters and 
around 13,000km of optical path. Fortunately for me, I’ve actually got 
near theoretical maximum IO over the NIC’s between the clusters and I’m 
iPerf’ing at around 8.90 to 9.2Gbit/sec over a 10GbE circuit. All MTU9000 
all the way through.
 
Anyway – I’m finding my AFM traffic to be dragging its feet and I don’t 
really understand why that might be. I’ve verified the links and 
transports ability as I said above with iPerf, and CERN’s FDT to near 
10Gbit/sec. 
 
I also verified the clusters on both sides in terms of disk IO and they 
both seem easily capable in IOZone and IOR tests of multiple GB/sec of 
throughput.
 
So – my questions:
 
1.   Are there very specific tunings AFM needs for high latency/long 
distance IO? 
2.   Are there very specific NIC/TCP-stack tunings (beyond the type of 
thing we already have in place) that benefits AFM over really long 
distances and high latency?
3.   We are seeing on the “cache” side really lazy/sticky “ls –als” in 
the home mount. It sometimes takes 20 to 30 seconds before the command 
line will report back with a long listing of files. Any ideas why it’d 
take that long to get a response from “home”.
 
We’ve got our TCP stack setup fairly aggressively, on all hosts that 
participate in these two clusters.
 
ethtool -C enp2s0f0 adaptive-rx off
ifconfig enp2s0f0 txqueuelen 1
sysctl -w net.core.rmem_max=536870912
sysctl -w net.core.wmem_max=536870912
sysctl -w net.ipv4.tcp_rmem="4096 87380 268435456"
sysctl -w net.ipv4.tcp_wmem="4096 65536 268435456"
sysctl -w net.core.netdev_max_backlog=25
sysctl -w net.ipv4.tcp_congestion_control=htcp
sysctl -w net.ipv4.tcp_mtu_probing=1
 
I modified a couple of small things on the AFM “cache” side to see if it’d 
make a difference such as:
 
mmchconfig afmNumWriteThreads=4
mmchconfig afmNumReadThreads=4
 
But no difference so far.
 
Thoughts would be appreciated. I’ve done this before over much shorter 
distances (30Km) and I’ve flattened a 10GbE wire without really 
tuning…anything. Are my large in-flight-packets 
numbers/long-time-to-acknowledgement semantics going to hurt here? I 
really thought AFM might be well designed for exactly this kind of work at 
long distance *and* high throughput – so I must be missing something!
 
-jc
 
 
 ___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss




___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss