Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS

2018-10-21 Thread Malahal R Naineni
That is correct. There is an "async" mount option which is related to Linux file systems. This has no effect at the NFS server. The default is "async". The other is "async" exports option which was invented to handle NFSv2 as the NFSv2 protocol doesn't have the notion of unstable writes and commits.  People used "async" export option with NFSv2 at the cost of losing data for performance when the Server goes down.
 
NFSv3 has unstable writes and commits baked into the protocol, so there is no reason (well, sort of) to use "async" export option with NFSv3.  kNFS supports this option as it still supports NFSv2. Ganesha doesn't support NFSv2 and there is no need to support "async" export option.
 
The fact that Linux NFS client (as well as many clients) use something called "close-to-open" semantics which forces them to COMMIT/SYNC all the file's dirty data as part of close(). Since fclose() is just a wrapper, it doesn't matter whether the application uses close() or fclose(). The NFS client will SYNC the files dirty data to the disk as part of close() system call.
 
The "async" mount will NOT fix this! I never experimented with "nocto" mount option but it might help here. See "man 5 nfs" for details.
 
Regards, Malahal.
PS: NFS 
- Original message -From: Jan-Frode Myklebust Sent by: gpfsug-discuss-boun...@spectrumscale.orgTo: gpfsug main discussion list Cc:Subject: Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFSDate: Wed, Oct 17, 2018 7:20 PM 
Also beware there are 2 different linux NFS "async" settings. A client side setting (mount -o async), which still cases sync on file close() -- and a server (knfs) side setting (/etc/exports) that violates NFS protocol and returns requests before data has hit stable storage.
 
 
  -jf 

On Wed, Oct 17, 2018 at 9:41 AM Tomer Perry  wrote:
Hi,Without going into to much details, AFAIR, Ontap integrate NVRAM into the NFS write cache ( as it was developed as a NAS product).Ontap is using the STABLE bit which kind of tell the client "hey, I have no write cache at all, everything is written to stable storage - thus, don't bother with commits ( sync) commands - they are meaningless".Regards,Tomer PerryScalable I/O Development (Spectrum Scale)email: t...@il.ibm.com1 Azrieli Center, Tel Aviv 67021, IsraelGlobal Tel:    +1 720 3422758Israel Tel:      +972 3 9188625Mobile:         +972 52 2554625From:        "Keigo Matsubara" To:        gpfsug main discussion list Date:        17/10/2018 16:35Subject:        Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFSSent by:        gpfsug-discuss-boun...@spectrumscale.org
I also wonder how many products actually exploit NFS async mode to improve I/O performance by sacrificing the file system consistency risk:gpfsug-discuss-boun...@spectrumscale.org wrote on 2018/10/17 22:26:52:>               Using this option usually improves performance, but at> the cost that an unclean server restart (i.e. a crash) can cause> data to be lost or corrupted."For instance, NetApp, at the very least FAS 3220 running Data OnTap 8.1.2p4 7-mode which I tested with, would forcibly *promote* async mode to sync mode.Promoting means even if NFS client requests async mount mode, the NFS server ignores and allows only sync mount mode.Best Regards,---Keigo Matsubara, Storage Solutions Client Technical Specialist, IBM JapanTEL: +81-50-3150-0595, T/L: 6205-0595___gpfsug-discuss mailing listgpfsug-discuss at spectrumscale.orghttp://gpfsug.org/mailman/listinfo/gpfsug-discuss___gpfsug-discuss mailing listgpfsug-discuss at spectrumscale.orghttp://gpfsug.org/mailman/listinfo/gpfsug-discuss
___gpfsug-discuss mailing listgpfsug-discuss at spectrumscale.orghttp://gpfsug.org/mailman/listinfo/gpfsug-discuss
 

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS

2018-10-17 Thread Sven Oehme
while most said here is correct, it can’t explain the performance of 200 files 
/sec and I couldn’t resist jumping in here :-D

 

lets assume for a second each operation is synchronous and its done by just 1 
thread. 200 files / sec means 5 ms on average per file write. Lets be generous 
and say the network layer is 100 usec per roud-trip network hop (including code 
processing on protocol node or client) and for visualization lets assume the 
setup looks like this : 

 

ESS Node ---ethernet--- Protocol Node –ethernet--- client Node . 

 

lets say the ESS write cache can absorb small io at a fixed cost of 300 usec if 
the heads are ethernet connected and not using IB (then it would be more in the 
250 usec range). That’s 300 +100(net1) +100(net2) usec or 500 usec in total. So 
you are a factor 10 off from your number. So lets just assume a create + write 
is more than just 1 roundtrip worth or synchronization, lets say it needs to do 
2 full roundtrips synchronously one for the create and one for the stable write 
that’s 1 ms, still 5x off of your 5 ms. 

 

So either there is a bug in the NFS Server, the NFS client or the storage is 
not behaving properly. To verify this, the best would be to run the following 
test :

 

Create a file on the ESS node itself in the shared filesystem like : 

 

/usr/lpp/mmfs/samples/perf/gpfsperf create seq -nongpfs -r 4k -n 1m -th 1 -dio 
/sharedfs/test

 

Now run the following command on one of the ESS nodes, then the protocol node 
and last the nfs client : 

 

/usr/lpp/mmfs/samples/perf/gpfsperf write seq -nongpfs -r 4k -n 1m -th 1 -dio 
/sharedfs/test

 

This will create 256 stable 4k write i/os to the storage system, I picked the 
number just to get a statistical relevant number of i/os you can change 1m to 
2m or 4m, just don’t make it too high or you might get variations due to 
de-staging or other side effects happening on the storage system, which you 
don’t care at this point you want to see the round trip time on each layer. 

 

The gpfsperf command will spit out a line like :

 

Data rate was XYZ Kbytes/sec, Op Rate was XYZ Ops/sec, Avg Latency was 0.266 
milliseconds, thread utilization 1.000, bytesTransferred 1048576

 

The only number here that matters is the average latency number , write it down.

 

What I would expect to get back is something like : 

 

On ESS Node – 300 usec average i/o 

On PN – 400 usec average i/o 

On Client – 500 usec average i/o 

 

If you get anything higher than the numbers above something fundamental is bad 
(in fact on fast system you may see from client no more than 200-300 usec 
response time) and it will be in the layer in between or below of where you 
test. 

If all the numbers are somewhere in line with my numbers above, it clearly 
points to a problem in NFS itself and the way it communicates with GPFS. Marc, 
myself and others have debugged numerous issues in this space in the past last 
one was fixed beginning of this year and ended up in some Scale 5.0.1.X 
release. To debug this is very hard and most of the time only possible with 
GPFS source code access which I no longer have. 

 

You would start with something like strace -Ttt -f -o tar-debug.out tar -xvf 
…..  and check what exact system calls are made to nfs client and how long each 
takes. You would then run a similar strace on the NFS server to see how many 
individual system calls will be made to GPFS and how long each takes. This will 
allow you to narrow down where the issue really is. But I suggest to start with 
the simpler test above as this might already point to a much simpler problem. 

 

Btw. I will be also be speaking at the UG Meeting at SC18 in Dallas, in case 
somebody wants to catch up …

 

Sven

 

From:  on behalf of Jan-Frode 
Myklebust 
Reply-To: gpfsug main discussion list 
Date: Wednesday, October 17, 2018 at 6:50 AM
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] Preliminary conclusion: single client, single 
thread, small files - native Scale vs NFS

 

Also beware there are 2 different linux NFS "async" settings. A client side 
setting (mount -o async), which still cases sync on file close() -- and a 
server (knfs) side setting (/etc/exports) that violates NFS protocol and 
returns requests before data has hit stable storage.

 

 

  -jf

 

On Wed, Oct 17, 2018 at 9:41 AM Tomer Perry  wrote:

Hi,

Without going into to much details, AFAIR, Ontap integrate NVRAM into the NFS 
write cache ( as it was developed as a NAS product).
Ontap is using the STABLE bit which kind of tell the client "hey, I have no 
write cache at all, everything is written to stable storage - thus, don't 
bother with commits ( sync) commands - they are meaningless".


Regards,

Tomer Perry
Scalable I/O Development (Spectrum Scale)
email: t...@il.ibm.com
1 Azrieli Center, Tel Aviv 67021, Israel
Global Tel:+1 720 3422758
Israel Tel:  +972 3 9188625
Mobile: +972 52 2554625




From:

Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS

2018-10-17 Thread Jan-Frode Myklebust
Also beware there are 2 different linux NFS "async" settings. A client side
setting (mount -o async), which still cases sync on file close() -- and a
server (knfs) side setting (/etc/exports) that violates NFS protocol and
returns requests before data has hit stable storage.


  -jf

On Wed, Oct 17, 2018 at 9:41 AM Tomer Perry  wrote:

> Hi,
>
> Without going into to much details, AFAIR, Ontap integrate NVRAM into the
> NFS write cache ( as it was developed as a NAS product).
> Ontap is using the STABLE bit which kind of tell the client "hey, I have
> no write cache at all, everything is written to stable storage - thus,
> don't bother with commits ( sync) commands - they are meaningless".
>
>
> Regards,
>
> Tomer Perry
> Scalable I/O Development (Spectrum Scale)
> email: t...@il.ibm.com
> 1 Azrieli Center, Tel Aviv 67021, Israel
> Global Tel:+1 720 3422758
> Israel Tel:  +972 3 9188625
> Mobile: +972 52 2554625
>
>
>
>
> From:"Keigo Matsubara" 
> To:    gpfsug main discussion list 
> Date:    17/10/2018 16:35
> Subject:        Re: [gpfsug-discuss] Preliminary conclusion: single
> client, single thread, small files - native Scale vs NFS
> Sent by:gpfsug-discuss-boun...@spectrumscale.org
> --
>
>
>
> I also wonder how many products actually exploit NFS async mode to improve
> I/O performance by sacrificing the file system consistency risk:
>
> gpfsug-discuss-boun...@spectrumscale.org wrote on 2018/10/17 22:26:52:
> >   Using this option usually improves performance, but at
> > the cost that an unclean server restart (i.e. a crash) can cause
> > data to be lost or corrupted."
>
> For instance, NetApp, at the very least FAS 3220 running Data OnTap
> 8.1.2p4 7-mode which I tested with, would forcibly *promote* async mode to
> sync mode.
> Promoting means even if NFS client requests async mount mode, the NFS
> server ignores and allows only sync mount mode.
>
> Best Regards,
> ---
> Keigo Matsubara, Storage Solutions Client Technical Specialist, IBM Japan
> TEL: +81-50-3150-0595, T/L: 6205-0595
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS

2018-10-17 Thread Stijn De Weirdt
hi all,

has anyone tried to use tools like eatmydata that allow the user to
"ignore" the syncs (there's another tool that has less explicit name if
it would make you feel better ;).

stijn

On 10/17/2018 03:26 PM, Tomer Perry wrote:
> Just to clarify ( from man exports):
> "  async  This option allows the NFS server to violate the NFS protocol 
> and reply to requests before any changes made by that request have been 
> committed  to  stable  storage  (e.g.
>   disc drive).
> 
>   Using this option usually improves performance, but at the 
> cost that an unclean server restart (i.e. a crash) can cause data to be 
> lost or corrupted."
> 
> With the Ganesha implementation in Spectrum Scale, it was decided not to 
> allow this violation - so this async export options wasn't exposed.
> I believe that for those customers  that agree to take the risk, using 
> async mount option ( from the client) will achieve similar behavior.
> 
> Regards,
> 
> Tomer Perry
> Scalable I/O Development (Spectrum Scale)
> email: t...@il.ibm.com
> 1 Azrieli Center, Tel Aviv 67021, Israel
> Global Tel:+1 720 3422758
> Israel Tel:  +972 3 9188625
> Mobile: +972 52 2554625
> 
> 
> 
> 
> From:   "Olaf Weiser" 
> To:     gpfsug main discussion list 
> Date:   17/10/2018 16:16
> Subject:Re: [gpfsug-discuss] Preliminary conclusion: single 
> client, single thread, small files - native Scale vs NFS
> Sent by:gpfsug-discuss-boun...@spectrumscale.org
> 
> 
> 
> Jallo Jan, 
> you can expect to get slightly improved numbers from the lower response 
> times of the HAWC ... but the loss of performance comes from the fact, 
> that 
> GPFS or (async kNFS) writes with multiple parallel threads - in opposite 
> to e.g. tar via GaneshaNFS  comes with single threads fsync on each file.. 
> 
> 
> you'll never outperform e.g. 128 (maybe slower), but, parallel threads 
> (running write-behind)   <--->   with one single but fast threads, 
> 
> so as Alex suggest.. if possible.. take gpfs client of kNFS  for those 
> types of workloads..
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> From:    Jan-Frode Myklebust 
> To:gpfsug main discussion list 
> Date:10/17/2018 02:24 PM
> Subject:Re: [gpfsug-discuss] Preliminary conclusion: single 
> client, single thread, small files - native Scale vs NFS
> Sent by:gpfsug-discuss-boun...@spectrumscale.org
> 
> 
> 
> Do you know if the slow throughput is caused by the network/nfs-protocol 
> layer, or does it help to use faster storage (ssd)? If on storage, have 
> you considered if HAWC can help?
> 
> I?m thinking about adding an SSD pool as a first tier to hold the active 
> dataset for a similar setup, but that?s mainly to solve the small file 
> read workload (i.e. random I/O ).
> 
> 
> -jf
> ons. 17. okt. 2018 kl. 07:47 skrev Alexander Saupp <
> alexander.sa...@de.ibm.com>:
> Dear Mailing List readers,
> 
> I've come to a preliminary conclusion that explains the behavior in an 
> appropriate manner, so I'm trying to summarize my current thinking with 
> this audience.
> 
> Problem statement: 
> Big performance derivation between native GPFS (fast) and loopback NFS 
> mount on the same node (way slower) for single client, single thread, 
> small files workload.
> 
> 
> Current explanation:
> tar seems to use close() on files, not fclose(). That is an application 
> choice and common behavior. The ideas is to allow OS write caching to 
> speed up process run time.
> 
> When running locally on ext3 / xfs / GPFS / .. that allows async destaging 
> of data down to disk, somewhat compromising data for better performance. 
> As we're talking about write caching on the same node that the application 
> runs on - a crash is missfortune but in the same failure domain.
> E.g. if you run a compile job that includes extraction of a tar and the 
> node crashes you'll have to restart the entire job, anyhow.
> 
> The NFSv2 spec defined that NFS io's are to be 'sync', probably because 
> the compile job on the nfs client would survive if the NFS Server crashes, 
> so the failure domain would be different
> 
> NFSv3 in rfc1813 below acknowledged the performance impact and introduced 
> the 'async' flag for NFS, which would handle IO's similar to local IOs, 
> allowing to destage in the background.
> 
> Keep in mind - applications, independent if running locally or via NFS can 
> always decided to use the fclose() option, which will ensure that data is 
> destaged to persistent storage right away.
> But its an applications choice if that's really mandatory or whet

Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS

2018-10-17 Thread Tomer Perry
Hi,

Without going into to much details, AFAIR, Ontap integrate NVRAM into the 
NFS write cache ( as it was developed as a NAS product).
Ontap is using the STABLE bit which kind of tell the client "hey, I have 
no write cache at all, everything is written to stable storage - thus, 
don't bother with commits ( sync) commands - they are meaningless".


Regards,

Tomer Perry
Scalable I/O Development (Spectrum Scale)
email: t...@il.ibm.com
1 Azrieli Center, Tel Aviv 67021, Israel
Global Tel:+1 720 3422758
Israel Tel:  +972 3 9188625
Mobile: +972 52 2554625




From:   "Keigo Matsubara" 
To: gpfsug main discussion list 
Date:   17/10/2018 16:35
Subject:    Re: [gpfsug-discuss] Preliminary conclusion: single 
client, single thread, small files - native Scale vs NFS
Sent by:gpfsug-discuss-boun...@spectrumscale.org



I also wonder how many products actually exploit NFS async mode to improve 
I/O performance by sacrificing the file system consistency risk:

gpfsug-discuss-boun...@spectrumscale.org wrote on 2018/10/17 22:26:52:
>   Using this option usually improves performance, but at
> the cost that an unclean server restart (i.e. a crash) can cause 
> data to be lost or corrupted."

For instance, NetApp, at the very least FAS 3220 running Data OnTap 
8.1.2p4 7-mode which I tested with, would forcibly *promote* async mode to 
sync mode.
Promoting means even if NFS client requests async mount mode, the NFS 
server ignores and allows only sync mount mode.

Best Regards,
---
Keigo Matsubara, Storage Solutions Client Technical Specialist, IBM Japan
TEL: +81-50-3150-0595, T/L: 6205-0595
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss





___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS

2018-10-17 Thread Keigo Matsubara
I also wonder how many products actually exploit NFS async mode to improve 
I/O performance by sacrificing the file system consistency risk:

gpfsug-discuss-boun...@spectrumscale.org wrote on 2018/10/17 22:26:52:
>   Using this option usually improves performance, but at
> the cost that an unclean server restart (i.e. a crash) can cause 
> data to be lost or corrupted."

For instance, NetApp, at the very least FAS 3220 running Data OnTap 
8.1.2p4 7-mode which I tested with, would forcibly *promote* async mode to 
sync mode.
Promoting means even if NFS client requests async mount mode, the NFS 
server ignores and allows only sync mount mode.

Best Regards,
---
Keigo Matsubara, Storage Solutions Client Technical Specialist, IBM Japan
TEL: +81-50-3150-0595, T/L: 6205-0595


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS

2018-10-17 Thread Jan-Frode Myklebust
My thinking was mainly that single threaded 200 files/second == 5 ms/file.
Where do these 5 ms go? Is it NFS protocol overhead, or is it waiting for
I/O so that it can be fixed with a lower latency storage backend?


 -jf

On Wed, Oct 17, 2018 at 9:15 AM Olaf Weiser  wrote:

> Jallo Jan,
> you can expect to get slightly improved numbers from the lower response
> times of the HAWC ... but the loss of performance comes from the fact, that
> GPFS or (async kNFS) writes with multiple parallel threads - in opposite
> to e.g. tar via GaneshaNFS  comes with single threads fsync on each file..
>
> you'll never outperform e.g. 128 (maybe slower), but, parallel threads
> (running write-behind)   <--->   with one single but fast threads, 
>
> so as Alex suggest.. if possible.. take gpfs client of kNFS  for those
> types of workloads..
>
>
>
>
>
>
>
>
>
>
> From:Jan-Frode Myklebust 
> To:gpfsug main discussion list 
> Date:    10/17/2018 02:24 PM
> Subject:    Re: [gpfsug-discuss] Preliminary conclusion: single
> client, single thread, small files - native Scale vs NFS
> Sent by:gpfsug-discuss-boun...@spectrumscale.org
> --
>
>
>
> Do you know if the slow throughput is caused by the network/nfs-protocol
> layer, or does it help to use faster storage (ssd)? If on storage, have you
> considered if HAWC can help?
>
> I’m thinking about adding an SSD pool as a first tier to hold the active
> dataset for a similar setup, but that’s mainly to solve the small file read
> workload (i.e. random I/O ).
>
>
> -jf
> ons. 17. okt. 2018 kl. 07:47 skrev Alexander Saupp <
> *alexander.sa...@de.ibm.com* >:
> Dear Mailing List readers,
>
> I've come to a preliminary conclusion that explains the behavior in an
> appropriate manner, so I'm trying to summarize my current thinking with
> this audience.
>
> *Problem statement: *
> Big performance derivation between native GPFS (fast) and loopback NFS
> mount on the same node (way slower) for single client, single thread, small
> files workload.
>
>
> *Current explanation:*
> tar seems to use close() on files, not fclose(). That is an application
> choice and common behavior. The ideas is to allow OS write caching to speed
> up process run time.
>
> When running locally on ext3 / xfs / GPFS / .. that allows async destaging
> of data down to disk, somewhat compromising data for better performance.
> As we're talking about write caching on the same node that the application
> runs on - a crash is missfortune but in the same failure domain.
> E.g. if you run a compile job that includes extraction of a tar and the
> node crashes you'll have to restart the entire job, anyhow.
>
> The NFSv2 spec defined that NFS io's are to be 'sync', probably because
> the compile job on the nfs client would survive if the NFS Server crashes,
> so the failure domain would be different
>
> NFSv3 in rfc1813 below acknowledged the performance impact and introduced
> the 'async' flag for NFS, which would handle IO's similar to local IOs,
> allowing to destage in the background.
>
> Keep in mind - applications, independent if running locally or via NFS can
> always decided to use the fclose() option, which will ensure that data is
> destaged to persistent storage right away.
> But its an applications choice if that's really mandatory or whether
> performance has higher priority.
>
> The linux 'sync' (man sync) tool allows to sync 'dirty' memory cache down
> to disk - very filesystem independent.
>
> -> single client, single thread, small files workload on GPFS can be
> destaged async, allowing to hide latency and parallelizing disk IOs.
> -> NFS client IO's are sync, so the second IO can only be started after
> the first one hit non volatile memory -> much higher latency
>
>
> The Spectrum Scale NFS implementation (based on ganesha) does not support
> the async mount option, which is a bit of a pitty. There might also be
> implementation differences compared to kernel-nfs, I did not investigate
> into that direction.
>
> However, the principles of the difference are explained for my by the
> above behavior.
>
> One workaround that I saw working well for multiple customers was to
> replace the NFS client by a Spectrum Scale nsd client.
> That has two advantages, but is certainly not suitable in all cases:
> - Improved speed by efficent NSD protocol and NSD client side write caching
> - Write Caching in the same failure domain as the application (on NSD
> client) which seems to be more reasonable compared to NFS Server side write
> caching.
>
> *References:*
>
> NFS sync vs async
> *https://tools.ietf.org/html/r

Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS

2018-10-17 Thread Tomer Perry
Just to clarify ( from man exports):
"  async  This option allows the NFS server to violate the NFS protocol 
and reply to requests before any changes made by that request have been 
committed  to  stable  storage  (e.g.
  disc drive).

  Using this option usually improves performance, but at the 
cost that an unclean server restart (i.e. a crash) can cause data to be 
lost or corrupted."

With the Ganesha implementation in Spectrum Scale, it was decided not to 
allow this violation - so this async export options wasn't exposed.
I believe that for those customers  that agree to take the risk, using 
async mount option ( from the client) will achieve similar behavior.

Regards,

Tomer Perry
Scalable I/O Development (Spectrum Scale)
email: t...@il.ibm.com
1 Azrieli Center, Tel Aviv 67021, Israel
Global Tel:+1 720 3422758
Israel Tel:  +972 3 9188625
Mobile: +972 52 2554625




From:   "Olaf Weiser" 
To: gpfsug main discussion list 
Date:   17/10/2018 16:16
Subject:    Re: [gpfsug-discuss] Preliminary conclusion: single 
client, single thread, small files - native Scale vs NFS
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Jallo Jan, 
you can expect to get slightly improved numbers from the lower response 
times of the HAWC ... but the loss of performance comes from the fact, 
that 
GPFS or (async kNFS) writes with multiple parallel threads - in opposite 
to e.g. tar via GaneshaNFS  comes with single threads fsync on each file.. 


you'll never outperform e.g. 128 (maybe slower), but, parallel threads 
(running write-behind)   <--->   with one single but fast threads, 

so as Alex suggest.. if possible.. take gpfs client of kNFS  for those 
types of workloads..










From:Jan-Frode Myklebust 
To:gpfsug main discussion list 
Date:10/17/2018 02:24 PM
Subject:    Re: [gpfsug-discuss] Preliminary conclusion: single 
client, single thread, small files - native Scale vs NFS
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Do you know if the slow throughput is caused by the network/nfs-protocol 
layer, or does it help to use faster storage (ssd)? If on storage, have 
you considered if HAWC can help?

I?m thinking about adding an SSD pool as a first tier to hold the active 
dataset for a similar setup, but that?s mainly to solve the small file 
read workload (i.e. random I/O ).


-jf
ons. 17. okt. 2018 kl. 07:47 skrev Alexander Saupp <
alexander.sa...@de.ibm.com>:
Dear Mailing List readers,

I've come to a preliminary conclusion that explains the behavior in an 
appropriate manner, so I'm trying to summarize my current thinking with 
this audience.

Problem statement: 
Big performance derivation between native GPFS (fast) and loopback NFS 
mount on the same node (way slower) for single client, single thread, 
small files workload.


Current explanation:
tar seems to use close() on files, not fclose(). That is an application 
choice and common behavior. The ideas is to allow OS write caching to 
speed up process run time.

When running locally on ext3 / xfs / GPFS / .. that allows async destaging 
of data down to disk, somewhat compromising data for better performance. 
As we're talking about write caching on the same node that the application 
runs on - a crash is missfortune but in the same failure domain.
E.g. if you run a compile job that includes extraction of a tar and the 
node crashes you'll have to restart the entire job, anyhow.

The NFSv2 spec defined that NFS io's are to be 'sync', probably because 
the compile job on the nfs client would survive if the NFS Server crashes, 
so the failure domain would be different

NFSv3 in rfc1813 below acknowledged the performance impact and introduced 
the 'async' flag for NFS, which would handle IO's similar to local IOs, 
allowing to destage in the background.

Keep in mind - applications, independent if running locally or via NFS can 
always decided to use the fclose() option, which will ensure that data is 
destaged to persistent storage right away.
But its an applications choice if that's really mandatory or whether 
performance has higher priority.

The linux 'sync' (man sync) tool allows to sync 'dirty' memory cache down 
to disk - very filesystem independent.

-> single client, single thread, small files workload on GPFS can be 
destaged async, allowing to hide latency and parallelizing disk IOs.
-> NFS client IO's are sync, so the second IO can only be started after 
the first one hit non volatile memory -> much higher latency


The Spectrum Scale NFS implementation (based on ganesha) does not support 
the async mount option, which is a bit of a pitty. There might also be 
implementation differences compared to kernel-nfs, I did not investigate 
into that direction.

However, the principles of the difference are explained for my by the 
above behavior. 

One workaround that I saw working well for multiple

Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS

2018-10-17 Thread Olaf Weiser
Jallo Jan, you can expect to get slightly improved
numbers from the lower response times of the HAWC ... but the loss of performance
comes from the fact, that GPFS or (async kNFS) writes with multiple
parallel threads - in opposite to e.g. tar via GaneshaNFS  comes with
single threads fsync on each file.. you'll never outperform e.g. 128 (maybe
slower), but, parallel threads (running write-behind)   <--->
  with one single but fast threads, so as Alex suggest.. if possible.. take
gpfs client of kNFS  for those types of workloads..From:      
 Jan-Frode Myklebust
To:      
 gpfsug main discussion
list Date:      
 10/17/2018 02:24 PMSubject:    
   Re: [gpfsug-discuss]
Preliminary conclusion: single client, single thread, small files - native
Scale vs NFSSent by:    
   gpfsug-discuss-boun...@spectrumscale.orgDo you know if the slow throughput is caused by the network/nfs-protocol
layer, or does it help to use faster storage (ssd)? If on storage, have
you considered if HAWC can help?I’m thinking about adding an SSD pool as a first tier to hold the active
dataset for a similar setup, but that’s mainly to solve the small file
read workload (i.e. random I/O ).-jfons. 17. okt. 2018 kl. 07:47 skrev Alexander Saupp :Dear Mailing List readers,I've come to a preliminary conclusion that explains the behavior in an
appropriate manner, so I'm trying to summarize my current thinking with
this audience.Problem statement: Big performance derivation between native GPFS (fast)
and loopback NFS mount on the same node (way slower) for single client,
single thread, small files workload.Current explanation:tar seems to use close() on files, not fclose(). That
is an application choice and common behavior. The ideas is to allow OS
write caching to speed up process run time.When running locally on ext3 / xfs / GPFS / .. that allows async destaging
of data down to disk, somewhat compromising data for better performance.
As we're talking about write caching on the same node that the application
runs on - a crash is missfortune but in the same failure domain.E.g. if you run a compile job that includes extraction of a tar and the
node crashes you'll have to restart the entire job, anyhow.The NFSv2 spec defined that NFS io's are to be 'sync', probably because
the compile job on the nfs client would survive if the NFS Server crashes,
so the failure domain would be differentNFSv3 in rfc1813 below acknowledged the performance impact and introduced
the 'async' flag for NFS, which would handle IO's similar to local IOs,
allowing to destage in the background.Keep in mind - applications, independent if running locally or via NFS
can always decided to use the fclose() option, which will ensure that data
is destaged to persistent storage right away.But its an applications choice if that's really mandatory or whether performance
has higher priority.The linux 'sync' (man sync) tool allows to sync 'dirty' memory cache down
to disk - very filesystem independent.-> single client, single thread, small files workload
on GPFS can be destaged async, allowing to hide latency and parallelizing
disk IOs.-> NFS client IO's are sync, so the second IO can only be started after
the first one hit non volatile memory -> much higher latencyThe Spectrum Scale NFS implementation (based on ganesha) does not support
the async mount option, which is a bit of a pitty. There might also be
implementation differences compared to kernel-nfs, I did not investigate
into that direction.However, the principles of the difference are explained for my by the above
behavior. One workaround that I saw working well for multiple customers was to replace
the NFS client by a Spectrum Scale nsd client.That has two advantages, but is certainly not suitable in all cases:- Improved speed by efficent NSD protocol and NSD client
side write caching- Write Caching in the same failure domain as the application (on NSD client)
which seems to be more reasonable compared to NFS Server side write caching.References:NFS sync vs asynchttps://tools.ietf.org/html/rfc1813The write throughput bottleneck caused by the synchronous definition of
write in the NFS version 2 protocol has been addressed by adding support
so that the NFS server can do unsafe writes.Unsafe writes are writes which have not been committed to stable storage
before the operation returns. This specification defines a method for committing
these unsafe writes to stable storage in a reliable way.sync() vs fsync()https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/com.ibm.aix.performance/using_sync_fsync_calls.htm- An application program makes an fsync() call for a specified file. This
causes all of the pages that contain modified data for that file to be
written to disk. The writing is complete when the fsync() call returns
to the program.- An application program makes a sync() call. This causes all of the file
pages in memory that contain modified data to be scheduled for writing
to disk. The 

Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS

2018-10-17 Thread Jan-Frode Myklebust
Do you know if the slow throughput is caused by the network/nfs-protocol
layer, or does it help to use faster storage (ssd)? If on storage, have you
considered if HAWC can help?

I’m thinking about adding an SSD pool as a first tier to hold the active
dataset for a similar setup, but that’s mainly to solve the small file read
workload (i.e. random I/O ).


-jf
ons. 17. okt. 2018 kl. 07:47 skrev Alexander Saupp <
alexander.sa...@de.ibm.com>:

> Dear Mailing List readers,
>
> I've come to a preliminary conclusion that explains the behavior in an
> appropriate manner, so I'm trying to summarize my current thinking with
> this audience.
>
> *Problem statement: *
>
>Big performance derivation between native GPFS (fast) and loopback NFS
>mount on the same node (way slower) for single client, single thread, small
>files workload.
>
>
>
> *Current explanation:*
>
>tar seems to use close() on files, not fclose(). That is an
>application choice and common behavior. The ideas is to allow OS write
>caching to speed up process run time.
>
>When running locally on ext3 / xfs / GPFS / .. that allows async
>destaging of data down to disk, somewhat compromising data for better
>performance.
>As we're talking about write caching on the same node that the
>application runs on - a crash is missfortune but in the same failure 
> domain.
>E.g. if you run a compile job that includes extraction of a tar and
>the node crashes you'll have to restart the entire job, anyhow.
>
>The NFSv2 spec defined that NFS io's are to be 'sync', probably
>because the compile job on the nfs client would survive if the NFS Server
>crashes, so the failure domain would be different
>
>NFSv3 in rfc1813 below acknowledged the performance impact and
>introduced the 'async' flag for NFS, which would handle IO's similar to
>local IOs, allowing to destage in the background.
>
>Keep in mind - applications, independent if running locally or via NFS
>can always decided to use the fclose() option, which will ensure that data
>is destaged to persistent storage right away.
>But its an applications choice if that's really mandatory or whether
>performance has higher priority.
>
>The linux 'sync' (man sync) tool allows to sync 'dirty' memory cache
>down to disk - very filesystem independent.
>
>
> -> single client, single thread, small files workload on GPFS can be
> destaged async, allowing to hide latency and parallelizing disk IOs.
> -> NFS client IO's are sync, so the second IO can only be started after
> the first one hit non volatile memory -> much higher latency
>
>
>
>The Spectrum Scale NFS implementation (based on ganesha) does not
>support the async mount option, which is a bit of a pitty. There might also
>be implementation differences compared to kernel-nfs, I did not investigate
>into that direction.
>
>However, the principles of the difference are explained for my by the
>above behavior.
>
>One workaround that I saw working well for multiple customers was to
>replace the NFS client by a Spectrum Scale nsd client.
>That has two advantages, but is certainly not suitable in all cases:
>   - Improved speed by efficent NSD protocol and NSD client side write
>   caching
>   - Write Caching in the same failure domain as the application (on
>   NSD client) which seems to be more reasonable compared to NFS Server 
> side
>   write caching.
>
>
> *References:*
>
> NFS sync vs async
> https://tools.ietf.org/html/rfc1813
> *The write throughput bottleneck caused by the synchronous definition of
> write in the NFS version 2 protocol has been addressed by adding support so
> that the NFS server can do unsafe writes.*
> Unsafe writes are writes which have not been committed to stable storage
> before the operation returns. This specification defines a method for
> committing these unsafe writes to stable storage in a reliable way.
>
>
> *sync() vs fsync()*
>
> https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/com.ibm.aix.performance/using_sync_fsync_calls.htm
> - An application program makes an fsync() call for a specified file. This
> causes all of the pages that contain modified data for that file to be
> written to disk. The writing is complete when the fsync() call returns to
> the program.
>
> - An application program makes a sync() call. This causes all of the file
> pages in memory that contain modified data to be scheduled for writing to
> disk. The writing is not necessarily complete when the sync() call returns
> to the program.
>
> - A user can enter the sync command, which in turn issues a sync() call.
> Again, some of the writes may not be complete when the user is prompted for
> input (or the next command in a shell script is processed).
>
>
> *close() vs fclose()*
> A successful close does not guarantee that the data has been successfully
> saved to disk, as the kernel defers writes. 

[gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS

2018-10-17 Thread Alexander Saupp


Dear Mailing List readers,

I've come to a preliminary conclusion that explains the behavior in an
appropriate manner, so I'm trying to summarize my current thinking with
this audience.

Problem statement:
   Big performance derivation between native GPFS (fast) and loopback NFS
   mount on the same node (way slower) for single client, single thread,
   small files workload.


Current explanation:
   tar seems to use close() on files, not fclose(). That is an application
   choice and common behavior. The ideas is to allow OS write caching to
   speed up process run time.

   When running locally on ext3 / xfs / GPFS / .. that allows async
   destaging of data down to disk, somewhat compromising data for better
   performance.
   As we're talking about write caching on the same node that the
   application runs on - a crash is missfortune but in the same failure
   domain.
   E.g. if you run a compile job that includes extraction of a tar and the
   node crashes you'll have to restart the entire job, anyhow.

   The NFSv2 spec defined that NFS io's are to be 'sync', probably because
   the compile job on the nfs client would survive if the NFS Server
   crashes, so the failure domain would be different

   NFSv3 in rfc1813 below acknowledged the performance impact and
   introduced the 'async' flag for NFS, which would handle IO's similar to
   local IOs, allowing to destage in the background.

   Keep in mind - applications, independent if running locally or via NFS
   can always decided to use the fclose() option, which will ensure that
   data is destaged to persistent storage right away.
   But its an applications choice if that's really mandatory or whether
   performance has higher priority.

   The linux 'sync' (man sync) tool allows to sync 'dirty' memory cache
   down to disk - very filesystem independent.


-> single client, single thread, small files workload on GPFS can be
destaged async, allowing to hide latency and parallelizing disk IOs.
-> NFS client IO's are sync, so the second IO can only be started after the
first one hit non volatile memory -> much higher latency


   The Spectrum Scale NFS implementation (based on ganesha) does not
   support the async mount option, which is a bit of a pitty. There might
   also be implementation differences compared to kernel-nfs, I did not
   investigate into that direction.

   However, the principles of the difference are explained for my by the
   above behavior.

   One workaround that I saw working well for multiple customers was to
   replace the NFS client by a Spectrum Scale nsd client.
   That has two advantages, but is certainly not suitable in all cases:
  - Improved speed by efficent NSD protocol and NSD client side write
  caching
  - Write Caching in the same failure domain as the application (on NSD
  client) which seems to be more reasonable compared to NFS Server side
  write caching.


References:

NFS sync vs async
https://tools.ietf.org/html/rfc1813
The write throughput bottleneck caused by the synchronous definition
of write in the NFS version 2 protocol has been addressed by adding support
so that the NFS server can do unsafe writes.
Unsafe writes are writes which have not been committed to stable
storage before the operation returns.  This specification defines a method
for committing these unsafe writes to stable storage in a reliable way.


sync() vs fsync()

https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/com.ibm.aix.performance/using_sync_fsync_calls.htm
- An application program makes an fsync() call for a specified file.
This causes all of the pages that contain modified data for that file to be
written to disk. The writing is complete when the fsync() call returns to
the program.

- An application program makes a sync() call. This causes all of the
file pages in memory that contain modified data to be scheduled for writing
to disk. The writing is not necessarily complete when the sync() call
returns to the program.

- A user can enter the sync command, which in turn issues a sync()
call. Again, some of the writes may not be complete when the user is
prompted for input (or the next command in a shell script is processed).


close() vs fclose()
A successful close does not guarantee that the data has been
successfully saved to disk, as the kernel defers writes.  It is not common
for a file system to flush the buffers when the stream is closed.  If you
need to be  sure  that  the  data  is
physically stored use fsync(2).  (It will depend on the disk hardware
at this point.)


Mit freundlichen Grüßen / Kind regards

Alexander Saupp

IBM Systems, Storage Platform, EMEA Storage Competence Center