Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS

Sven Oehme Wed, 17 Oct 2018 09:22:25 -0700

while most said here is correct, it can’t explain the performance of 200 files 
/sec and I couldn’t resist jumping in here :-D

lets assume for a second each operation is synchronous and its done by just 1 
thread. 200 files / sec means 5 ms on average per file write. Lets be generous 
and say the network layer is 100 usec per roud-trip network hop (including code 
processing on protocol node or client) and for visualization lets assume the 
setup looks like this : 

ESS Node ---ethernet--- Protocol Node –ethernet--- client Node . 

lets say the ESS write cache can absorb small io at a fixed cost of 300 usec if 
the heads are ethernet connected and not using IB (then it would be more in the 
250 usec range). That’s 300 +100(net1) +100(net2) usec or 500 usec in total. So 
you are a factor 10 off from your number. So lets just assume a create + write 
is more than just 1 roundtrip worth or synchronization, lets say it needs to do 
2 full roundtrips synchronously one for the create and one for the stable write 
that’s 1 ms, still 5x off of your 5 ms. 

So either there is a bug in the NFS Server, the NFS client or the storage is 
not behaving properly. To verify this, the best would be to run the following 
test :

Create a file on the ESS node itself in the shared filesystem like : 

/usr/lpp/mmfs/samples/perf/gpfsperf create seq -nongpfs -r 4k -n 1m -th 1 -dio 
/sharedfs/test

Now run the following command on one of the ESS nodes, then the protocol node 
and last the nfs client : 

/usr/lpp/mmfs/samples/perf/gpfsperf write seq -nongpfs -r 4k -n 1m -th 1 -dio 
/sharedfs/test

This will create 256 stable 4k write i/os to the storage system, I picked the 
number just to get a statistical relevant number of i/os you can change 1m to 
2m or 4m, just don’t make it too high or you might get variations due to 
de-staging or other side effects happening on the storage system, which you 
don’t care at this point you want to see the round trip time on each layer. 

The gpfsperf command will spit out a line like :

Data rate was XYZ Kbytes/sec, Op Rate was XYZ Ops/sec, Avg Latency was 0.266 
milliseconds, thread utilization 1.000, bytesTransferred 1048576

The only number here that matters is the average latency number , write it down.

What I would expect to get back is something like : 

On ESS Node – 300 usec average i/o 

On PN – 400 usec average i/o 

On Client – 500 usec average i/o 

If you get anything higher than the numbers above something fundamental is bad 
(in fact on fast system you may see from client no more than 200-300 usec 
response time) and it will be in the layer in between or below of where you 
test. 

If all the numbers are somewhere in line with my numbers above, it clearly 
points to a problem in NFS itself and the way it communicates with GPFS. Marc, 
myself and others have debugged numerous issues in this space in the past last 
one was fixed beginning of this year and ended up in some Scale 5.0.1.X 
release. To debug this is very hard and most of the time only possible with 
GPFS source code access which I no longer have. 

You would start with something like strace -Ttt -f -o tar-debug.out tar -xvf 
…..  and check what exact system calls are made to nfs client and how long each 
takes. You would then run a similar strace on the NFS server to see how many 
individual system calls will be made to GPFS and how long each takes. This will 
allow you to narrow down where the issue really is. But I suggest to start with 
the simpler test above as this might already point to a much simpler problem. 

Btw. I will be also be speaking at the UG Meeting at SC18 in Dallas, in case 
somebody wants to catch up …

Sven

From: <[email protected]> on behalf of Jan-Frode 
Myklebust <[email protected]>
Reply-To: gpfsug main discussion list <[email protected]>
Date: Wednesday, October 17, 2018 at 6:50 AM
To: gpfsug main discussion list <[email protected]>
Subject: Re: [gpfsug-discuss] Preliminary conclusion: single client, single 
thread, small files - native Scale vs NFS

Also beware there are 2 different linux NFS "async" settings. A client side 
setting (mount -o async), which still cases sync on file close() -- and a 
server (knfs) side setting (/etc/exports) that violates NFS protocol and 
returns requests before data has hit stable storage.

  -jf

On Wed, Oct 17, 2018 at 9:41 AM Tomer Perry <[email protected]> wrote:

Hi,

Without going into to much details, AFAIR, Ontap integrate NVRAM into the NFS 
write cache ( as it was developed as a NAS product).
Ontap is using the STABLE bit which kind of tell the client "hey, I have no 
write cache at all, everything is written to stable storage - thus, don't 
bother with commits ( sync) commands - they are meaningless".

Regards,

Tomer Perry
Scalable I/O Development (Spectrum Scale)
email: [email protected]
1 Azrieli Center, Tel Aviv 67021, Israel
Global Tel:    +1 720 3422758
Israel Tel:      +972 3 9188625
Mobile:         +972 52 2554625

From:        "Keigo Matsubara" <[email protected]>
To:        gpfsug main discussion list <[email protected]>
Date:        17/10/2018 16:35
Subject:        Re: [gpfsug-discuss] Preliminary conclusion: single client, 
single thread, small files - native Scale vs NFS
Sent by:        [email protected]

I also wonder how many products actually exploit NFS async mode to improve I/O 
performance by sacrificing the file system consistency risk:

[email protected] wrote on 2018/10/17 22:26:52:
>               Using this option usually improves performance, but at
> the cost that an unclean server restart (i.e. a crash) can cause 
> data to be lost or corrupted."

For instance, NetApp, at the very least FAS 3220 running Data OnTap 8.1.2p4 
7-mode which I tested with, would forcibly *promote* async mode to sync mode.
Promoting means even if NFS client requests async mount mode, the NFS server 
ignores and allows only sync mount mode.

Best Regards,
---
Keigo Matsubara, Storage Solutions Client Technical Specialist, IBM Japan
TEL: +81-50-3150-0595, T/L: 6205-0595
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________ gpfsug-discuss mailing list 
gpfsug-discuss at spectrumscale.org 
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS

Reply via email to