Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS
That is correct. There is an "async" mount option which is related to Linux file systems. This has no effect at the NFS server. The default is "async". The other is "async" exports option which was invented to handle NFSv2 as the NFSv2 protocol doesn't have the notion of unstable writes and commits. People used "async" export option with NFSv2 at the cost of losing data for performance when the Server goes down. NFSv3 has unstable writes and commits baked into the protocol, so there is no reason (well, sort of) to use "async" export option with NFSv3. kNFS supports this option as it still supports NFSv2. Ganesha doesn't support NFSv2 and there is no need to support "async" export option. The fact that Linux NFS client (as well as many clients) use something called "close-to-open" semantics which forces them to COMMIT/SYNC all the file's dirty data as part of close(). Since fclose() is just a wrapper, it doesn't matter whether the application uses close() or fclose(). The NFS client will SYNC the files dirty data to the disk as part of close() system call. The "async" mount will NOT fix this! I never experimented with "nocto" mount option but it might help here. See "man 5 nfs" for details. Regards, Malahal. PS: NFS - Original message -From: Jan-Frode Myklebust Sent by: gpfsug-discuss-boun...@spectrumscale.orgTo: gpfsug main discussion list Cc:Subject: Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFSDate: Wed, Oct 17, 2018 7:20 PM Also beware there are 2 different linux NFS "async" settings. A client side setting (mount -o async), which still cases sync on file close() -- and a server (knfs) side setting (/etc/exports) that violates NFS protocol and returns requests before data has hit stable storage. -jf On Wed, Oct 17, 2018 at 9:41 AM Tomer Perrywrote: Hi,Without going into to much details, AFAIR, Ontap integrate NVRAM into the NFS write cache ( as it was developed as a NAS product).Ontap is using the STABLE bit which kind of tell the client "hey, I have no write cache at all, everything is written to stable storage - thus, don't bother with commits ( sync) commands - they are meaningless".Regards,Tomer PerryScalable I/O Development (Spectrum Scale)email: t...@il.ibm.com1 Azrieli Center, Tel Aviv 67021, IsraelGlobal Tel: +1 720 3422758Israel Tel: +972 3 9188625Mobile: +972 52 2554625From: "Keigo Matsubara" To: gpfsug main discussion list Date: 17/10/2018 16:35Subject: Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFSSent by: gpfsug-discuss-boun...@spectrumscale.org I also wonder how many products actually exploit NFS async mode to improve I/O performance by sacrificing the file system consistency risk:gpfsug-discuss-boun...@spectrumscale.org wrote on 2018/10/17 22:26:52:> Using this option usually improves performance, but at> the cost that an unclean server restart (i.e. a crash) can cause> data to be lost or corrupted."For instance, NetApp, at the very least FAS 3220 running Data OnTap 8.1.2p4 7-mode which I tested with, would forcibly *promote* async mode to sync mode.Promoting means even if NFS client requests async mount mode, the NFS server ignores and allows only sync mount mode.Best Regards,---Keigo Matsubara, Storage Solutions Client Technical Specialist, IBM JapanTEL: +81-50-3150-0595, T/L: 6205-0595___gpfsug-discuss mailing listgpfsug-discuss at spectrumscale.orghttp://gpfsug.org/mailman/listinfo/gpfsug-discuss___gpfsug-discuss mailing listgpfsug-discuss at spectrumscale.orghttp://gpfsug.org/mailman/listinfo/gpfsug-discuss ___gpfsug-discuss mailing listgpfsug-discuss at spectrumscale.orghttp://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS
while most said here is correct, it can’t explain the performance of 200 files /sec and I couldn’t resist jumping in here :-D lets assume for a second each operation is synchronous and its done by just 1 thread. 200 files / sec means 5 ms on average per file write. Lets be generous and say the network layer is 100 usec per roud-trip network hop (including code processing on protocol node or client) and for visualization lets assume the setup looks like this : ESS Node ---ethernet--- Protocol Node –ethernet--- client Node . lets say the ESS write cache can absorb small io at a fixed cost of 300 usec if the heads are ethernet connected and not using IB (then it would be more in the 250 usec range). That’s 300 +100(net1) +100(net2) usec or 500 usec in total. So you are a factor 10 off from your number. So lets just assume a create + write is more than just 1 roundtrip worth or synchronization, lets say it needs to do 2 full roundtrips synchronously one for the create and one for the stable write that’s 1 ms, still 5x off of your 5 ms. So either there is a bug in the NFS Server, the NFS client or the storage is not behaving properly. To verify this, the best would be to run the following test : Create a file on the ESS node itself in the shared filesystem like : /usr/lpp/mmfs/samples/perf/gpfsperf create seq -nongpfs -r 4k -n 1m -th 1 -dio /sharedfs/test Now run the following command on one of the ESS nodes, then the protocol node and last the nfs client : /usr/lpp/mmfs/samples/perf/gpfsperf write seq -nongpfs -r 4k -n 1m -th 1 -dio /sharedfs/test This will create 256 stable 4k write i/os to the storage system, I picked the number just to get a statistical relevant number of i/os you can change 1m to 2m or 4m, just don’t make it too high or you might get variations due to de-staging or other side effects happening on the storage system, which you don’t care at this point you want to see the round trip time on each layer. The gpfsperf command will spit out a line like : Data rate was XYZ Kbytes/sec, Op Rate was XYZ Ops/sec, Avg Latency was 0.266 milliseconds, thread utilization 1.000, bytesTransferred 1048576 The only number here that matters is the average latency number , write it down. What I would expect to get back is something like : On ESS Node – 300 usec average i/o On PN – 400 usec average i/o On Client – 500 usec average i/o If you get anything higher than the numbers above something fundamental is bad (in fact on fast system you may see from client no more than 200-300 usec response time) and it will be in the layer in between or below of where you test. If all the numbers are somewhere in line with my numbers above, it clearly points to a problem in NFS itself and the way it communicates with GPFS. Marc, myself and others have debugged numerous issues in this space in the past last one was fixed beginning of this year and ended up in some Scale 5.0.1.X release. To debug this is very hard and most of the time only possible with GPFS source code access which I no longer have. You would start with something like strace -Ttt -f -o tar-debug.out tar -xvf ….. and check what exact system calls are made to nfs client and how long each takes. You would then run a similar strace on the NFS server to see how many individual system calls will be made to GPFS and how long each takes. This will allow you to narrow down where the issue really is. But I suggest to start with the simpler test above as this might already point to a much simpler problem. Btw. I will be also be speaking at the UG Meeting at SC18 in Dallas, in case somebody wants to catch up … Sven From: on behalf of Jan-Frode Myklebust Reply-To: gpfsug main discussion list Date: Wednesday, October 17, 2018 at 6:50 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS Also beware there are 2 different linux NFS "async" settings. A client side setting (mount -o async), which still cases sync on file close() -- and a server (knfs) side setting (/etc/exports) that violates NFS protocol and returns requests before data has hit stable storage. -jf On Wed, Oct 17, 2018 at 9:41 AM Tomer Perry wrote: Hi, Without going into to much details, AFAIR, Ontap integrate NVRAM into the NFS write cache ( as it was developed as a NAS product). Ontap is using the STABLE bit which kind of tell the client "hey, I have no write cache at all, everything is written to stable storage - thus, don't bother with commits ( sync) commands - they are meaningless". Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: t...@il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel:+1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From:
Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS
Also beware there are 2 different linux NFS "async" settings. A client side setting (mount -o async), which still cases sync on file close() -- and a server (knfs) side setting (/etc/exports) that violates NFS protocol and returns requests before data has hit stable storage. -jf On Wed, Oct 17, 2018 at 9:41 AM Tomer Perry wrote: > Hi, > > Without going into to much details, AFAIR, Ontap integrate NVRAM into the > NFS write cache ( as it was developed as a NAS product). > Ontap is using the STABLE bit which kind of tell the client "hey, I have > no write cache at all, everything is written to stable storage - thus, > don't bother with commits ( sync) commands - they are meaningless". > > > Regards, > > Tomer Perry > Scalable I/O Development (Spectrum Scale) > email: t...@il.ibm.com > 1 Azrieli Center, Tel Aviv 67021, Israel > Global Tel:+1 720 3422758 > Israel Tel: +972 3 9188625 > Mobile: +972 52 2554625 > > > > > From:"Keigo Matsubara" > To: gpfsug main discussion list > Date: 17/10/2018 16:35 > Subject: Re: [gpfsug-discuss] Preliminary conclusion: single > client, single thread, small files - native Scale vs NFS > Sent by:gpfsug-discuss-boun...@spectrumscale.org > -- > > > > I also wonder how many products actually exploit NFS async mode to improve > I/O performance by sacrificing the file system consistency risk: > > gpfsug-discuss-boun...@spectrumscale.org wrote on 2018/10/17 22:26:52: > > Using this option usually improves performance, but at > > the cost that an unclean server restart (i.e. a crash) can cause > > data to be lost or corrupted." > > For instance, NetApp, at the very least FAS 3220 running Data OnTap > 8.1.2p4 7-mode which I tested with, would forcibly *promote* async mode to > sync mode. > Promoting means even if NFS client requests async mount mode, the NFS > server ignores and allows only sync mount mode. > > Best Regards, > --- > Keigo Matsubara, Storage Solutions Client Technical Specialist, IBM Japan > TEL: +81-50-3150-0595, T/L: 6205-0595 > ___ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > ___ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS
hi all, has anyone tried to use tools like eatmydata that allow the user to "ignore" the syncs (there's another tool that has less explicit name if it would make you feel better ;). stijn On 10/17/2018 03:26 PM, Tomer Perry wrote: > Just to clarify ( from man exports): > " async This option allows the NFS server to violate the NFS protocol > and reply to requests before any changes made by that request have been > committed to stable storage (e.g. > disc drive). > > Using this option usually improves performance, but at the > cost that an unclean server restart (i.e. a crash) can cause data to be > lost or corrupted." > > With the Ganesha implementation in Spectrum Scale, it was decided not to > allow this violation - so this async export options wasn't exposed. > I believe that for those customers that agree to take the risk, using > async mount option ( from the client) will achieve similar behavior. > > Regards, > > Tomer Perry > Scalable I/O Development (Spectrum Scale) > email: t...@il.ibm.com > 1 Azrieli Center, Tel Aviv 67021, Israel > Global Tel:+1 720 3422758 > Israel Tel: +972 3 9188625 > Mobile: +972 52 2554625 > > > > > From: "Olaf Weiser" > To: gpfsug main discussion list > Date: 17/10/2018 16:16 > Subject:Re: [gpfsug-discuss] Preliminary conclusion: single > client, single thread, small files - native Scale vs NFS > Sent by:gpfsug-discuss-boun...@spectrumscale.org > > > > Jallo Jan, > you can expect to get slightly improved numbers from the lower response > times of the HAWC ... but the loss of performance comes from the fact, > that > GPFS or (async kNFS) writes with multiple parallel threads - in opposite > to e.g. tar via GaneshaNFS comes with single threads fsync on each file.. > > > you'll never outperform e.g. 128 (maybe slower), but, parallel threads > (running write-behind) <---> with one single but fast threads, > > so as Alex suggest.. if possible.. take gpfs client of kNFS for those > types of workloads.. > > > > > > > > > > > From: Jan-Frode Myklebust > To:gpfsug main discussion list > Date:10/17/2018 02:24 PM > Subject:Re: [gpfsug-discuss] Preliminary conclusion: single > client, single thread, small files - native Scale vs NFS > Sent by:gpfsug-discuss-boun...@spectrumscale.org > > > > Do you know if the slow throughput is caused by the network/nfs-protocol > layer, or does it help to use faster storage (ssd)? If on storage, have > you considered if HAWC can help? > > I?m thinking about adding an SSD pool as a first tier to hold the active > dataset for a similar setup, but that?s mainly to solve the small file > read workload (i.e. random I/O ). > > > -jf > ons. 17. okt. 2018 kl. 07:47 skrev Alexander Saupp < > alexander.sa...@de.ibm.com>: > Dear Mailing List readers, > > I've come to a preliminary conclusion that explains the behavior in an > appropriate manner, so I'm trying to summarize my current thinking with > this audience. > > Problem statement: > Big performance derivation between native GPFS (fast) and loopback NFS > mount on the same node (way slower) for single client, single thread, > small files workload. > > > Current explanation: > tar seems to use close() on files, not fclose(). That is an application > choice and common behavior. The ideas is to allow OS write caching to > speed up process run time. > > When running locally on ext3 / xfs / GPFS / .. that allows async destaging > of data down to disk, somewhat compromising data for better performance. > As we're talking about write caching on the same node that the application > runs on - a crash is missfortune but in the same failure domain. > E.g. if you run a compile job that includes extraction of a tar and the > node crashes you'll have to restart the entire job, anyhow. > > The NFSv2 spec defined that NFS io's are to be 'sync', probably because > the compile job on the nfs client would survive if the NFS Server crashes, > so the failure domain would be different > > NFSv3 in rfc1813 below acknowledged the performance impact and introduced > the 'async' flag for NFS, which would handle IO's similar to local IOs, > allowing to destage in the background. > > Keep in mind - applications, independent if running locally or via NFS can > always decided to use the fclose() option, which will ensure that data is > destaged to persistent storage right away. > But its an applications choice if that's really mandatory or whet
Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS
Hi, Without going into to much details, AFAIR, Ontap integrate NVRAM into the NFS write cache ( as it was developed as a NAS product). Ontap is using the STABLE bit which kind of tell the client "hey, I have no write cache at all, everything is written to stable storage - thus, don't bother with commits ( sync) commands - they are meaningless". Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: t...@il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel:+1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: "Keigo Matsubara" To: gpfsug main discussion list Date: 17/10/2018 16:35 Subject: Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS Sent by:gpfsug-discuss-boun...@spectrumscale.org I also wonder how many products actually exploit NFS async mode to improve I/O performance by sacrificing the file system consistency risk: gpfsug-discuss-boun...@spectrumscale.org wrote on 2018/10/17 22:26:52: > Using this option usually improves performance, but at > the cost that an unclean server restart (i.e. a crash) can cause > data to be lost or corrupted." For instance, NetApp, at the very least FAS 3220 running Data OnTap 8.1.2p4 7-mode which I tested with, would forcibly *promote* async mode to sync mode. Promoting means even if NFS client requests async mount mode, the NFS server ignores and allows only sync mount mode. Best Regards, --- Keigo Matsubara, Storage Solutions Client Technical Specialist, IBM Japan TEL: +81-50-3150-0595, T/L: 6205-0595 ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS
I also wonder how many products actually exploit NFS async mode to improve I/O performance by sacrificing the file system consistency risk: gpfsug-discuss-boun...@spectrumscale.org wrote on 2018/10/17 22:26:52: > Using this option usually improves performance, but at > the cost that an unclean server restart (i.e. a crash) can cause > data to be lost or corrupted." For instance, NetApp, at the very least FAS 3220 running Data OnTap 8.1.2p4 7-mode which I tested with, would forcibly *promote* async mode to sync mode. Promoting means even if NFS client requests async mount mode, the NFS server ignores and allows only sync mount mode. Best Regards, --- Keigo Matsubara, Storage Solutions Client Technical Specialist, IBM Japan TEL: +81-50-3150-0595, T/L: 6205-0595 ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS
My thinking was mainly that single threaded 200 files/second == 5 ms/file. Where do these 5 ms go? Is it NFS protocol overhead, or is it waiting for I/O so that it can be fixed with a lower latency storage backend? -jf On Wed, Oct 17, 2018 at 9:15 AM Olaf Weiser wrote: > Jallo Jan, > you can expect to get slightly improved numbers from the lower response > times of the HAWC ... but the loss of performance comes from the fact, that > GPFS or (async kNFS) writes with multiple parallel threads - in opposite > to e.g. tar via GaneshaNFS comes with single threads fsync on each file.. > > you'll never outperform e.g. 128 (maybe slower), but, parallel threads > (running write-behind) <---> with one single but fast threads, > > so as Alex suggest.. if possible.. take gpfs client of kNFS for those > types of workloads.. > > > > > > > > > > > From:Jan-Frode Myklebust > To:gpfsug main discussion list > Date: 10/17/2018 02:24 PM > Subject: Re: [gpfsug-discuss] Preliminary conclusion: single > client, single thread, small files - native Scale vs NFS > Sent by:gpfsug-discuss-boun...@spectrumscale.org > -- > > > > Do you know if the slow throughput is caused by the network/nfs-protocol > layer, or does it help to use faster storage (ssd)? If on storage, have you > considered if HAWC can help? > > I’m thinking about adding an SSD pool as a first tier to hold the active > dataset for a similar setup, but that’s mainly to solve the small file read > workload (i.e. random I/O ). > > > -jf > ons. 17. okt. 2018 kl. 07:47 skrev Alexander Saupp < > *alexander.sa...@de.ibm.com* >: > Dear Mailing List readers, > > I've come to a preliminary conclusion that explains the behavior in an > appropriate manner, so I'm trying to summarize my current thinking with > this audience. > > *Problem statement: * > Big performance derivation between native GPFS (fast) and loopback NFS > mount on the same node (way slower) for single client, single thread, small > files workload. > > > *Current explanation:* > tar seems to use close() on files, not fclose(). That is an application > choice and common behavior. The ideas is to allow OS write caching to speed > up process run time. > > When running locally on ext3 / xfs / GPFS / .. that allows async destaging > of data down to disk, somewhat compromising data for better performance. > As we're talking about write caching on the same node that the application > runs on - a crash is missfortune but in the same failure domain. > E.g. if you run a compile job that includes extraction of a tar and the > node crashes you'll have to restart the entire job, anyhow. > > The NFSv2 spec defined that NFS io's are to be 'sync', probably because > the compile job on the nfs client would survive if the NFS Server crashes, > so the failure domain would be different > > NFSv3 in rfc1813 below acknowledged the performance impact and introduced > the 'async' flag for NFS, which would handle IO's similar to local IOs, > allowing to destage in the background. > > Keep in mind - applications, independent if running locally or via NFS can > always decided to use the fclose() option, which will ensure that data is > destaged to persistent storage right away. > But its an applications choice if that's really mandatory or whether > performance has higher priority. > > The linux 'sync' (man sync) tool allows to sync 'dirty' memory cache down > to disk - very filesystem independent. > > -> single client, single thread, small files workload on GPFS can be > destaged async, allowing to hide latency and parallelizing disk IOs. > -> NFS client IO's are sync, so the second IO can only be started after > the first one hit non volatile memory -> much higher latency > > > The Spectrum Scale NFS implementation (based on ganesha) does not support > the async mount option, which is a bit of a pitty. There might also be > implementation differences compared to kernel-nfs, I did not investigate > into that direction. > > However, the principles of the difference are explained for my by the > above behavior. > > One workaround that I saw working well for multiple customers was to > replace the NFS client by a Spectrum Scale nsd client. > That has two advantages, but is certainly not suitable in all cases: > - Improved speed by efficent NSD protocol and NSD client side write caching > - Write Caching in the same failure domain as the application (on NSD > client) which seems to be more reasonable compared to NFS Server side write > caching. > > *References:* > > NFS sync vs async > *https://tools.ietf.org/html/r
Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS
Just to clarify ( from man exports): " async This option allows the NFS server to violate the NFS protocol and reply to requests before any changes made by that request have been committed to stable storage (e.g. disc drive). Using this option usually improves performance, but at the cost that an unclean server restart (i.e. a crash) can cause data to be lost or corrupted." With the Ganesha implementation in Spectrum Scale, it was decided not to allow this violation - so this async export options wasn't exposed. I believe that for those customers that agree to take the risk, using async mount option ( from the client) will achieve similar behavior. Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: t...@il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel:+1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: "Olaf Weiser" To: gpfsug main discussion list Date: 17/10/2018 16:16 Subject: Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS Sent by:gpfsug-discuss-boun...@spectrumscale.org Jallo Jan, you can expect to get slightly improved numbers from the lower response times of the HAWC ... but the loss of performance comes from the fact, that GPFS or (async kNFS) writes with multiple parallel threads - in opposite to e.g. tar via GaneshaNFS comes with single threads fsync on each file.. you'll never outperform e.g. 128 (maybe slower), but, parallel threads (running write-behind) <---> with one single but fast threads, so as Alex suggest.. if possible.. take gpfs client of kNFS for those types of workloads.. From:Jan-Frode Myklebust To:gpfsug main discussion list Date:10/17/2018 02:24 PM Subject: Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS Sent by:gpfsug-discuss-boun...@spectrumscale.org Do you know if the slow throughput is caused by the network/nfs-protocol layer, or does it help to use faster storage (ssd)? If on storage, have you considered if HAWC can help? I?m thinking about adding an SSD pool as a first tier to hold the active dataset for a similar setup, but that?s mainly to solve the small file read workload (i.e. random I/O ). -jf ons. 17. okt. 2018 kl. 07:47 skrev Alexander Saupp < alexander.sa...@de.ibm.com>: Dear Mailing List readers, I've come to a preliminary conclusion that explains the behavior in an appropriate manner, so I'm trying to summarize my current thinking with this audience. Problem statement: Big performance derivation between native GPFS (fast) and loopback NFS mount on the same node (way slower) for single client, single thread, small files workload. Current explanation: tar seems to use close() on files, not fclose(). That is an application choice and common behavior. The ideas is to allow OS write caching to speed up process run time. When running locally on ext3 / xfs / GPFS / .. that allows async destaging of data down to disk, somewhat compromising data for better performance. As we're talking about write caching on the same node that the application runs on - a crash is missfortune but in the same failure domain. E.g. if you run a compile job that includes extraction of a tar and the node crashes you'll have to restart the entire job, anyhow. The NFSv2 spec defined that NFS io's are to be 'sync', probably because the compile job on the nfs client would survive if the NFS Server crashes, so the failure domain would be different NFSv3 in rfc1813 below acknowledged the performance impact and introduced the 'async' flag for NFS, which would handle IO's similar to local IOs, allowing to destage in the background. Keep in mind - applications, independent if running locally or via NFS can always decided to use the fclose() option, which will ensure that data is destaged to persistent storage right away. But its an applications choice if that's really mandatory or whether performance has higher priority. The linux 'sync' (man sync) tool allows to sync 'dirty' memory cache down to disk - very filesystem independent. -> single client, single thread, small files workload on GPFS can be destaged async, allowing to hide latency and parallelizing disk IOs. -> NFS client IO's are sync, so the second IO can only be started after the first one hit non volatile memory -> much higher latency The Spectrum Scale NFS implementation (based on ganesha) does not support the async mount option, which is a bit of a pitty. There might also be implementation differences compared to kernel-nfs, I did not investigate into that direction. However, the principles of the difference are explained for my by the above behavior. One workaround that I saw working well for multiple
Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS
Jallo Jan, you can expect to get slightly improved numbers from the lower response times of the HAWC ... but the loss of performance comes from the fact, that GPFS or (async kNFS) writes with multiple parallel threads - in opposite to e.g. tar via GaneshaNFS comes with single threads fsync on each file.. you'll never outperform e.g. 128 (maybe slower), but, parallel threads (running write-behind) <---> with one single but fast threads, so as Alex suggest.. if possible.. take gpfs client of kNFS for those types of workloads..From: Jan-Frode Myklebust To: gpfsug main discussion list Date: 10/17/2018 02:24 PMSubject: Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFSSent by: gpfsug-discuss-boun...@spectrumscale.orgDo you know if the slow throughput is caused by the network/nfs-protocol layer, or does it help to use faster storage (ssd)? If on storage, have you considered if HAWC can help?I’m thinking about adding an SSD pool as a first tier to hold the active dataset for a similar setup, but that’s mainly to solve the small file read workload (i.e. random I/O ).-jfons. 17. okt. 2018 kl. 07:47 skrev Alexander Saupp:Dear Mailing List readers,I've come to a preliminary conclusion that explains the behavior in an appropriate manner, so I'm trying to summarize my current thinking with this audience.Problem statement: Big performance derivation between native GPFS (fast) and loopback NFS mount on the same node (way slower) for single client, single thread, small files workload.Current explanation:tar seems to use close() on files, not fclose(). That is an application choice and common behavior. The ideas is to allow OS write caching to speed up process run time.When running locally on ext3 / xfs / GPFS / .. that allows async destaging of data down to disk, somewhat compromising data for better performance. As we're talking about write caching on the same node that the application runs on - a crash is missfortune but in the same failure domain.E.g. if you run a compile job that includes extraction of a tar and the node crashes you'll have to restart the entire job, anyhow.The NFSv2 spec defined that NFS io's are to be 'sync', probably because the compile job on the nfs client would survive if the NFS Server crashes, so the failure domain would be differentNFSv3 in rfc1813 below acknowledged the performance impact and introduced the 'async' flag for NFS, which would handle IO's similar to local IOs, allowing to destage in the background.Keep in mind - applications, independent if running locally or via NFS can always decided to use the fclose() option, which will ensure that data is destaged to persistent storage right away.But its an applications choice if that's really mandatory or whether performance has higher priority.The linux 'sync' (man sync) tool allows to sync 'dirty' memory cache down to disk - very filesystem independent.-> single client, single thread, small files workload on GPFS can be destaged async, allowing to hide latency and parallelizing disk IOs.-> NFS client IO's are sync, so the second IO can only be started after the first one hit non volatile memory -> much higher latencyThe Spectrum Scale NFS implementation (based on ganesha) does not support the async mount option, which is a bit of a pitty. There might also be implementation differences compared to kernel-nfs, I did not investigate into that direction.However, the principles of the difference are explained for my by the above behavior. One workaround that I saw working well for multiple customers was to replace the NFS client by a Spectrum Scale nsd client.That has two advantages, but is certainly not suitable in all cases:- Improved speed by efficent NSD protocol and NSD client side write caching- Write Caching in the same failure domain as the application (on NSD client) which seems to be more reasonable compared to NFS Server side write caching.References:NFS sync vs asynchttps://tools.ietf.org/html/rfc1813The write throughput bottleneck caused by the synchronous definition of write in the NFS version 2 protocol has been addressed by adding support so that the NFS server can do unsafe writes.Unsafe writes are writes which have not been committed to stable storage before the operation returns. This specification defines a method for committing these unsafe writes to stable storage in a reliable way.sync() vs fsync()https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/com.ibm.aix.performance/using_sync_fsync_calls.htm- An application program makes an fsync() call for a specified file. This causes all of the pages that contain modified data for that file to be written to disk. The writing is complete when the fsync() call returns to the program.- An application program makes a sync() call. This causes all of the file pages in memory that contain modified data to be scheduled for writing to disk. The
Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS
Do you know if the slow throughput is caused by the network/nfs-protocol layer, or does it help to use faster storage (ssd)? If on storage, have you considered if HAWC can help? I’m thinking about adding an SSD pool as a first tier to hold the active dataset for a similar setup, but that’s mainly to solve the small file read workload (i.e. random I/O ). -jf ons. 17. okt. 2018 kl. 07:47 skrev Alexander Saupp < alexander.sa...@de.ibm.com>: > Dear Mailing List readers, > > I've come to a preliminary conclusion that explains the behavior in an > appropriate manner, so I'm trying to summarize my current thinking with > this audience. > > *Problem statement: * > >Big performance derivation between native GPFS (fast) and loopback NFS >mount on the same node (way slower) for single client, single thread, small >files workload. > > > > *Current explanation:* > >tar seems to use close() on files, not fclose(). That is an >application choice and common behavior. The ideas is to allow OS write >caching to speed up process run time. > >When running locally on ext3 / xfs / GPFS / .. that allows async >destaging of data down to disk, somewhat compromising data for better >performance. >As we're talking about write caching on the same node that the >application runs on - a crash is missfortune but in the same failure > domain. >E.g. if you run a compile job that includes extraction of a tar and >the node crashes you'll have to restart the entire job, anyhow. > >The NFSv2 spec defined that NFS io's are to be 'sync', probably >because the compile job on the nfs client would survive if the NFS Server >crashes, so the failure domain would be different > >NFSv3 in rfc1813 below acknowledged the performance impact and >introduced the 'async' flag for NFS, which would handle IO's similar to >local IOs, allowing to destage in the background. > >Keep in mind - applications, independent if running locally or via NFS >can always decided to use the fclose() option, which will ensure that data >is destaged to persistent storage right away. >But its an applications choice if that's really mandatory or whether >performance has higher priority. > >The linux 'sync' (man sync) tool allows to sync 'dirty' memory cache >down to disk - very filesystem independent. > > > -> single client, single thread, small files workload on GPFS can be > destaged async, allowing to hide latency and parallelizing disk IOs. > -> NFS client IO's are sync, so the second IO can only be started after > the first one hit non volatile memory -> much higher latency > > > >The Spectrum Scale NFS implementation (based on ganesha) does not >support the async mount option, which is a bit of a pitty. There might also >be implementation differences compared to kernel-nfs, I did not investigate >into that direction. > >However, the principles of the difference are explained for my by the >above behavior. > >One workaround that I saw working well for multiple customers was to >replace the NFS client by a Spectrum Scale nsd client. >That has two advantages, but is certainly not suitable in all cases: > - Improved speed by efficent NSD protocol and NSD client side write > caching > - Write Caching in the same failure domain as the application (on > NSD client) which seems to be more reasonable compared to NFS Server > side > write caching. > > > *References:* > > NFS sync vs async > https://tools.ietf.org/html/rfc1813 > *The write throughput bottleneck caused by the synchronous definition of > write in the NFS version 2 protocol has been addressed by adding support so > that the NFS server can do unsafe writes.* > Unsafe writes are writes which have not been committed to stable storage > before the operation returns. This specification defines a method for > committing these unsafe writes to stable storage in a reliable way. > > > *sync() vs fsync()* > > https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/com.ibm.aix.performance/using_sync_fsync_calls.htm > - An application program makes an fsync() call for a specified file. This > causes all of the pages that contain modified data for that file to be > written to disk. The writing is complete when the fsync() call returns to > the program. > > - An application program makes a sync() call. This causes all of the file > pages in memory that contain modified data to be scheduled for writing to > disk. The writing is not necessarily complete when the sync() call returns > to the program. > > - A user can enter the sync command, which in turn issues a sync() call. > Again, some of the writes may not be complete when the user is prompted for > input (or the next command in a shell script is processed). > > > *close() vs fclose()* > A successful close does not guarantee that the data has been successfully > saved to disk, as the kernel defers writes.
[gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS
Dear Mailing List readers, I've come to a preliminary conclusion that explains the behavior in an appropriate manner, so I'm trying to summarize my current thinking with this audience. Problem statement: Big performance derivation between native GPFS (fast) and loopback NFS mount on the same node (way slower) for single client, single thread, small files workload. Current explanation: tar seems to use close() on files, not fclose(). That is an application choice and common behavior. The ideas is to allow OS write caching to speed up process run time. When running locally on ext3 / xfs / GPFS / .. that allows async destaging of data down to disk, somewhat compromising data for better performance. As we're talking about write caching on the same node that the application runs on - a crash is missfortune but in the same failure domain. E.g. if you run a compile job that includes extraction of a tar and the node crashes you'll have to restart the entire job, anyhow. The NFSv2 spec defined that NFS io's are to be 'sync', probably because the compile job on the nfs client would survive if the NFS Server crashes, so the failure domain would be different NFSv3 in rfc1813 below acknowledged the performance impact and introduced the 'async' flag for NFS, which would handle IO's similar to local IOs, allowing to destage in the background. Keep in mind - applications, independent if running locally or via NFS can always decided to use the fclose() option, which will ensure that data is destaged to persistent storage right away. But its an applications choice if that's really mandatory or whether performance has higher priority. The linux 'sync' (man sync) tool allows to sync 'dirty' memory cache down to disk - very filesystem independent. -> single client, single thread, small files workload on GPFS can be destaged async, allowing to hide latency and parallelizing disk IOs. -> NFS client IO's are sync, so the second IO can only be started after the first one hit non volatile memory -> much higher latency The Spectrum Scale NFS implementation (based on ganesha) does not support the async mount option, which is a bit of a pitty. There might also be implementation differences compared to kernel-nfs, I did not investigate into that direction. However, the principles of the difference are explained for my by the above behavior. One workaround that I saw working well for multiple customers was to replace the NFS client by a Spectrum Scale nsd client. That has two advantages, but is certainly not suitable in all cases: - Improved speed by efficent NSD protocol and NSD client side write caching - Write Caching in the same failure domain as the application (on NSD client) which seems to be more reasonable compared to NFS Server side write caching. References: NFS sync vs async https://tools.ietf.org/html/rfc1813 The write throughput bottleneck caused by the synchronous definition of write in the NFS version 2 protocol has been addressed by adding support so that the NFS server can do unsafe writes. Unsafe writes are writes which have not been committed to stable storage before the operation returns. This specification defines a method for committing these unsafe writes to stable storage in a reliable way. sync() vs fsync() https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/com.ibm.aix.performance/using_sync_fsync_calls.htm - An application program makes an fsync() call for a specified file. This causes all of the pages that contain modified data for that file to be written to disk. The writing is complete when the fsync() call returns to the program. - An application program makes a sync() call. This causes all of the file pages in memory that contain modified data to be scheduled for writing to disk. The writing is not necessarily complete when the sync() call returns to the program. - A user can enter the sync command, which in turn issues a sync() call. Again, some of the writes may not be complete when the user is prompted for input (or the next command in a shell script is processed). close() vs fclose() A successful close does not guarantee that the data has been successfully saved to disk, as the kernel defers writes. It is not common for a file system to flush the buffers when the stream is closed. If you need to be sure that the data is physically stored use fsync(2). (It will depend on the disk hardware at this point.) Mit freundlichen Grüßen / Kind regards Alexander Saupp IBM Systems, Storage Platform, EMEA Storage Competence Center