Re: [ceph-users] CephFS+NFS For VMWare

2018-07-02 Thread David C
On Sat, 30 Jun 2018, 21:48 Nick Fisk,  wrote:

> Hi Paul,
>
>
>
> Thanks for your response, is there anything you can go into more detail on
> and share with the list? I’m sure it would be much appreciated by more than
> just myself.
>
>
>
> I was planning on Kernel CephFS and NFS server, both seem to achieve
> better performance, although stability is of greater concern.
>
FWIW, a recent nfs-ganesha could be more stable than kernel nfs. I've had a
fair few issues with Knfs exporting cephfs, it works fine until there is an
issue with your cluster such as an mds going down or slow requests and you
can end up with your nfsd processes in the dreaded uninterruptable sleep.

Also consider CTDB for basic active/active nfs on cephfs, works fine for
normal Linux clients, not sure how well it would work with esx. If you want
want to use use ctdb with ganesha I think you're restricted to using the
plain vfs fsal, I don't think the ceph fsal will give you the consistent
file handles you need for client fail over to work properly (although could
be wrong there).



>
> Thanks,
>
> Nick
>
> *From:* Paul Emmerich [mailto:paul.emmer...@croit.io]
> *Sent:* 29 June 2018 17:57
> *To:* Nick Fisk 
> *Cc:* ceph-users 
> *Subject:* Re: [ceph-users] CephFS+NFS For VMWare
>
>
>
> VMWare can be quite picky about NFS servers.
>
> Some things that you should test before deploying anything with that in
> production:
>
>
>
> * failover
>
> * reconnects after NFS reboots or outages
>
> * NFS3 vs NFS4
>
> * Kernel NFS (which kernel version? cephfs-fuse or cephfs-kernel?) vs NFS
> Ganesha (VFS FSAL vs. Ceph FSAL)
>
> * Stress tests with lots of VMWare clients - we had a setup than ran fine
> with 5 big VMWare hypervisors but started to get random deadlocks once we
> added 5 more
>
>
>
> We are running CephFS + NFS + VMWare in production but we've encountered
> *a lot* of problems until we got that stable for a few configurations.
>
> Be prepared to debug NFS problems at a low level with tcpdump and a
> careful read of the RFC and NFS server source ;)
>
>
>
> Paul
>
>
>
> 2018-06-29 18:48 GMT+02:00 Nick Fisk :
>
> This is for us peeps using Ceph with VMWare.
>
>
>
> My current favoured solution for consuming Ceph in VMWare is via RBD’s
> formatted with XFS and exported via NFS to ESXi. This seems to perform
> better than iSCSI+VMFS which seems to not play nicely with Ceph’s PG
> contention issues particularly if working with thin provisioned VMDK’s.
>
>
>
> I’ve still been noticing some performance issues however, mainly
> noticeable when doing any form of storage migrations. This is largely due
> to the way vSphere transfers VM’s in 64KB IO’s at a QD of 32. vSphere does
> this so Arrays with QOS can balance the IO easier than if larger IO’s were
> submitted. However Ceph’s PG locking means that only one or two of these
> IO’s can happen at a time, seriously lowering throughput. Typically you
> won’t be able to push more than 20-25MB/s during a storage migration
>
>
>
> There is also another issue in that the IO needed for the XFS journal on
> the RBD, can cause contention and effectively also means every NFS write IO
> sends 2 down to Ceph. This can have an impact on latency as well. Due to
> possible PG contention caused by the XFS journal updates when multiple IO’s
> are in flight, you normally end up making more and more RBD’s to try and
> spread the load. This normally means you end up having to do storage
> migrations…..you can see where I’m getting at here.
>
>
>
> I’ve been thinking for a while that CephFS works around a lot of these
> limitations.
>
>
>
> 1.   It supports fancy striping, so should mean there is less per
> object contention
>
> 2.   There is no FS in the middle to maintain a journal and other
> associated IO
>
> 3.   A single large NFS mount should have none of the disadvantages
> seen with a single RBD
>
> 4.   No need to migrate VM’s about because of #3
>
> 5.   No need to fstrim after deleting VM’s
>
> 6.   Potential to do away with pacemaker and use LVS to do
> active/active NFS as ESXi does its own locking with files
>
>
>
> With this in mind I exported a CephFS mount via NFS and then mounted it to
> an ESXi host as a test.
>
>
>
> Initial results are looking very good. I’m seeing storage migrations to
> the NFS mount going at over 200MB/s, which equates to several thousand IO’s
> and seems to be writing at the intended QD32.
>
>
>
> I need to do more testing to make sure everything works as intended, but
> like I say, promising initial results.
>
>
>
> Further testing needs to be done to 

Re: [ceph-users] CephFS+NFS For VMWare

2018-07-02 Thread Maged Mokhtar
Hi Nick, 

With iSCSI we reach over 150 MB/s vmotion for single vm, 1 GB/s for 7-8
vm migrations. Since these are 64KB block sizes, latency/iops is a large
factor, you need either controllers with write back cache or all flash .
hdds without write cache will suffer even with external wal/db on ssds,
giving around 80 MB/s vmotion migration. Potentially it may be possible
to get higher vmotion speeds by using fancy striping but i would not
recommend this unless your total queue depths in all your vms is small
compared to the number of osds. 

Regarding thin provisioning, a vmdk provisioned as lazy zeroed does have
an "initial" large impact on random write performance, could be up to
10x slower. If you are writing a random 64KB to an un-allocated vmfs
block, vmfs will first write 1MB to fill the block with zeros then write
the 64KB client data, so although a lot of data is being written the
perceived client bandwidth is very low. The performance will gradually
get better with time until the disk is fully provisioned. It is also
possible to thick eager zero the vmdk disk at creation time. Again this
is more apparent with random writes rather than sequential or vmotion
load. 

Maged 

On 2018-06-29 18:48, Nick Fisk wrote:

> This is for us peeps using Ceph with VMWare. 
> 
> My current favoured solution for consuming Ceph in VMWare is via RBD's 
> formatted with XFS and exported via NFS to ESXi. This seems to perform better 
> than iSCSI+VMFS which seems to not play nicely with Ceph's PG contention 
> issues particularly if working with thin provisioned VMDK's. 
> 
> I've still been noticing some performance issues however, mainly noticeable 
> when doing any form of storage migrations. This is largely due to the way 
> vSphere transfers VM's in 64KB IO's at a QD of 32. vSphere does this so 
> Arrays with QOS can balance the IO easier than if larger IO's were submitted. 
> However Ceph's PG locking means that only one or two of these IO's can happen 
> at a time, seriously lowering throughput. Typically you won't be able to push 
> more than 20-25MB/s during a storage migration 
> 
> There is also another issue in that the IO needed for the XFS journal on the 
> RBD, can cause contention and effectively also means every NFS write IO sends 
> 2 down to Ceph. This can have an impact on latency as well. Due to possible 
> PG contention caused by the XFS journal updates when multiple IO's are in 
> flight, you normally end up making more and more RBD's to try and spread the 
> load. This normally means you end up having to do storage migrations…..you 
> can see where I'm getting at here. 
> 
> I've been thinking for a while that CephFS works around a lot of these 
> limitations. 
> 
> 1.   It supports fancy striping, so should mean there is less per object 
> contention 
> 
> 2.   There is no FS in the middle to maintain a journal and other 
> associated IO 
> 
> 3.   A single large NFS mount should have none of the disadvantages seen 
> with a single RBD 
> 
> 4.   No need to migrate VM's about because of #3 
> 
> 5.   No need to fstrim after deleting VM's 
> 
> 6.   Potential to do away with pacemaker and use LVS to do active/active 
> NFS as ESXi does its own locking with files 
> 
> With this in mind I exported a CephFS mount via NFS and then mounted it to an 
> ESXi host as a test. 
> 
> Initial results are looking very good. I'm seeing storage migrations to the 
> NFS mount going at over 200MB/s, which equates to several thousand IO's and 
> seems to be writing at the intended QD32. 
> 
> I need to do more testing to make sure everything works as intended, but like 
> I say, promising initial results. 
> 
> Further testing needs to be done to see what sort of MDS performance is 
> required, I would imagine that since we are mainly dealing with large files, 
> it might not be that critical. I also need to consider the stability of 
> CephFS, RBD is relatively simple and is in use by a large proportion of the 
> Ceph community. CephFS is a lot easier to "upset". 
> 
> Nick 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS+NFS For VMWare

2018-07-02 Thread Paul Emmerich
Hi,

we've used Kernel CephFS + Kernel NFS in the past. It works reasonably well
in many scenarios, especially for smaller setups.
However, you absolutely must use a recent kernel, we encountered a lot of
deadlocks and other random hangs and reconnect
failures with kernel 4.9 in larger setups under load.

The main problem with the Kernel NFS server is that the whole concept of
having an NFS server in the kernel is just bad design,
especially when it's backed by Ceph (it's only in there for historical
reasons). It also means that it's hard to extend and development
cycles are slow. Especially features related to clustering are lacking in
the kernel server. For example, Ganesha 2.7 will come with
a RADOS recovery backend. Also, pass-through of CephFS delegations is a
nice feature that is only possible if your NFS server and
CephFS client are tightly integrated. So the future of NFS is definitely
with Ganesha.


Paul


2018-06-30 22:22 GMT+02:00 Nick Fisk :

> Hi Paul,
>
>
>
> Thanks for your response, is there anything you can go into more detail on
> and share with the list? I’m sure it would be much appreciated by more than
> just myself.
>
>
>
> I was planning on Kernel CephFS and NFS server, both seem to achieve
> better performance, although stability is of greater concern.
>
>
>
> Thanks,
>
> Nick
>
> *From:* Paul Emmerich [mailto:paul.emmer...@croit.io]
> *Sent:* 29 June 2018 17:57
> *To:* Nick Fisk 
> *Cc:* ceph-users 
> *Subject:* Re: [ceph-users] CephFS+NFS For VMWare
>
>
>
> VMWare can be quite picky about NFS servers.
>
> Some things that you should test before deploying anything with that in
> production:
>
>
>
> * failover
>
> * reconnects after NFS reboots or outages
>
> * NFS3 vs NFS4
>
> * Kernel NFS (which kernel version? cephfs-fuse or cephfs-kernel?) vs NFS
> Ganesha (VFS FSAL vs. Ceph FSAL)
>
> * Stress tests with lots of VMWare clients - we had a setup than ran fine
> with 5 big VMWare hypervisors but started to get random deadlocks once we
> added 5 more
>
>
>
> We are running CephFS + NFS + VMWare in production but we've encountered
> *a lot* of problems until we got that stable for a few configurations.
>
> Be prepared to debug NFS problems at a low level with tcpdump and a
> careful read of the RFC and NFS server source ;)
>
>
>
> Paul
>
>
>
> 2018-06-29 18:48 GMT+02:00 Nick Fisk :
>
> This is for us peeps using Ceph with VMWare.
>
>
>
> My current favoured solution for consuming Ceph in VMWare is via RBD’s
> formatted with XFS and exported via NFS to ESXi. This seems to perform
> better than iSCSI+VMFS which seems to not play nicely with Ceph’s PG
> contention issues particularly if working with thin provisioned VMDK’s.
>
>
>
> I’ve still been noticing some performance issues however, mainly
> noticeable when doing any form of storage migrations. This is largely due
> to the way vSphere transfers VM’s in 64KB IO’s at a QD of 32. vSphere does
> this so Arrays with QOS can balance the IO easier than if larger IO’s were
> submitted. However Ceph’s PG locking means that only one or two of these
> IO’s can happen at a time, seriously lowering throughput. Typically you
> won’t be able to push more than 20-25MB/s during a storage migration
>
>
>
> There is also another issue in that the IO needed for the XFS journal on
> the RBD, can cause contention and effectively also means every NFS write IO
> sends 2 down to Ceph. This can have an impact on latency as well. Due to
> possible PG contention caused by the XFS journal updates when multiple IO’s
> are in flight, you normally end up making more and more RBD’s to try and
> spread the load. This normally means you end up having to do storage
> migrations…..you can see where I’m getting at here.
>
>
>
> I’ve been thinking for a while that CephFS works around a lot of these
> limitations.
>
>
>
> 1.   It supports fancy striping, so should mean there is less per
> object contention
>
> 2.   There is no FS in the middle to maintain a journal and other
> associated IO
>
> 3.   A single large NFS mount should have none of the disadvantages
> seen with a single RBD
>
> 4.   No need to migrate VM’s about because of #3
>
> 5.   No need to fstrim after deleting VM’s
>
> 6.   Potential to do away with pacemaker and use LVS to do
> active/active NFS as ESXi does its own locking with files
>
>
>
> With this in mind I exported a CephFS mount via NFS and then mounted it to
> an ESXi host as a test.
>
>
>
> Initial results are looking very good. I’m seeing storage migrations to
> the NFS mount going at over 200MB/s, which equates to several thousand IO’s
> and seem

Re: [ceph-users] CephFS+NFS For VMWare

2018-07-02 Thread Nick Fisk


Quoting Ilya Dryomov :


On Fri, Jun 29, 2018 at 8:08 PM Nick Fisk  wrote:


This is for us peeps using Ceph with VMWare.



My current favoured solution for consuming Ceph in VMWare is via  
RBD’s formatted with XFS and exported via NFS to ESXi. This seems  
to perform better than iSCSI+VMFS which seems to not play nicely  
with Ceph’s PG contention issues particularly if working with thin  
provisioned VMDK’s.




I’ve still been noticing some performance issues however, mainly  
noticeable when doing any form of storage migrations. This is  
largely due to the way vSphere transfers VM’s in 64KB IO’s at a QD  
of 32. vSphere does this so Arrays with QOS can balance the IO  
easier than if larger IO’s were submitted. However Ceph’s PG  
locking means that only one or two of these IO’s can happen at a  
time, seriously lowering throughput. Typically you won’t be able to  
push more than 20-25MB/s during a storage migration




There is also another issue in that the IO needed for the XFS  
journal on the RBD, can cause contention and effectively also means  
every NFS write IO sends 2 down to Ceph. This can have an impact on  
latency as well. Due to possible PG contention caused by the XFS  
journal updates when multiple IO’s are in flight, you normally end  
up making more and more RBD’s to try and spread the load. This  
normally means you end up having to do storage migrations…..you can  
see where I’m getting at here.




I’ve been thinking for a while that CephFS works around a lot of  
these limitations.




1.   It supports fancy striping, so should mean there is less  
per object contention


Hi Nick,

Fancy striping is supported since 4.17.  I think its primary use case
is small sequential I/Os, so not sure if it is going to help much, but
it might be worth doing some benchmarking.


Thanks Ilya, I will try to find sometime to also investigate this.
Nick



Thanks,

Ilya




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS+NFS For VMWare

2018-07-02 Thread Ilya Dryomov
On Fri, Jun 29, 2018 at 8:08 PM Nick Fisk  wrote:
>
> This is for us peeps using Ceph with VMWare.
>
>
>
> My current favoured solution for consuming Ceph in VMWare is via RBD’s 
> formatted with XFS and exported via NFS to ESXi. This seems to perform better 
> than iSCSI+VMFS which seems to not play nicely with Ceph’s PG contention 
> issues particularly if working with thin provisioned VMDK’s.
>
>
>
> I’ve still been noticing some performance issues however, mainly noticeable 
> when doing any form of storage migrations. This is largely due to the way 
> vSphere transfers VM’s in 64KB IO’s at a QD of 32. vSphere does this so 
> Arrays with QOS can balance the IO easier than if larger IO’s were submitted. 
> However Ceph’s PG locking means that only one or two of these IO’s can happen 
> at a time, seriously lowering throughput. Typically you won’t be able to push 
> more than 20-25MB/s during a storage migration
>
>
>
> There is also another issue in that the IO needed for the XFS journal on the 
> RBD, can cause contention and effectively also means every NFS write IO sends 
> 2 down to Ceph. This can have an impact on latency as well. Due to possible 
> PG contention caused by the XFS journal updates when multiple IO’s are in 
> flight, you normally end up making more and more RBD’s to try and spread the 
> load. This normally means you end up having to do storage migrations…..you 
> can see where I’m getting at here.
>
>
>
> I’ve been thinking for a while that CephFS works around a lot of these 
> limitations.
>
>
>
> 1.   It supports fancy striping, so should mean there is less per object 
> contention

Hi Nick,

Fancy striping is supported since 4.17.  I think its primary use case
is small sequential I/Os, so not sure if it is going to help much, but
it might be worth doing some benchmarking.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS+NFS For VMWare

2018-06-29 Thread Paul Emmerich
VMWare can be quite picky about NFS servers.
Some things that you should test before deploying anything with that in
production:

* failover
* reconnects after NFS reboots or outages
* NFS3 vs NFS4
* Kernel NFS (which kernel version? cephfs-fuse or cephfs-kernel?) vs NFS
Ganesha (VFS FSAL vs. Ceph FSAL)
* Stress tests with lots of VMWare clients - we had a setup than ran fine
with 5 big VMWare hypervisors but started to get random deadlocks once we
added 5 more

We are running CephFS + NFS + VMWare in production but we've encountered *a
lot* of problems until we got that stable for a few configurations.
Be prepared to debug NFS problems at a low level with tcpdump and a careful
read of the RFC and NFS server source ;)

Paul

2018-06-29 18:48 GMT+02:00 Nick Fisk :

> This is for us peeps using Ceph with VMWare.
>
>
>
> My current favoured solution for consuming Ceph in VMWare is via RBD’s
> formatted with XFS and exported via NFS to ESXi. This seems to perform
> better than iSCSI+VMFS which seems to not play nicely with Ceph’s PG
> contention issues particularly if working with thin provisioned VMDK’s.
>
>
>
> I’ve still been noticing some performance issues however, mainly
> noticeable when doing any form of storage migrations. This is largely due
> to the way vSphere transfers VM’s in 64KB IO’s at a QD of 32. vSphere does
> this so Arrays with QOS can balance the IO easier than if larger IO’s were
> submitted. However Ceph’s PG locking means that only one or two of these
> IO’s can happen at a time, seriously lowering throughput. Typically you
> won’t be able to push more than 20-25MB/s during a storage migration
>
>
>
> There is also another issue in that the IO needed for the XFS journal on
> the RBD, can cause contention and effectively also means every NFS write IO
> sends 2 down to Ceph. This can have an impact on latency as well. Due to
> possible PG contention caused by the XFS journal updates when multiple IO’s
> are in flight, you normally end up making more and more RBD’s to try and
> spread the load. This normally means you end up having to do storage
> migrations…..you can see where I’m getting at here.
>
>
>
> I’ve been thinking for a while that CephFS works around a lot of these
> limitations.
>
>
>
> 1.   It supports fancy striping, so should mean there is less per
> object contention
>
> 2.   There is no FS in the middle to maintain a journal and other
> associated IO
>
> 3.   A single large NFS mount should have none of the disadvantages
> seen with a single RBD
>
> 4.   No need to migrate VM’s about because of #3
>
> 5.   No need to fstrim after deleting VM’s
>
> 6.   Potential to do away with pacemaker and use LVS to do
> active/active NFS as ESXi does its own locking with files
>
>
>
> With this in mind I exported a CephFS mount via NFS and then mounted it to
> an ESXi host as a test.
>
>
>
> Initial results are looking very good. I’m seeing storage migrations to
> the NFS mount going at over 200MB/s, which equates to several thousand IO’s
> and seems to be writing at the intended QD32.
>
>
>
> I need to do more testing to make sure everything works as intended, but
> like I say, promising initial results.
>
>
>
> Further testing needs to be done to see what sort of MDS performance is
> required, I would imagine that since we are mainly dealing with large
> files, it might not be that critical. I also need to consider the stability
> of CephFS, RBD is relatively simple and is in use by a large proportion of
> the Ceph community. CephFS is a lot easier to “upset”.
>
>
>
> Nick
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS+NFS For VMWare

2018-06-29 Thread Nick Fisk
This is for us peeps using Ceph with VMWare.

 

My current favoured solution for consuming Ceph in VMWare is via RBD's 
formatted with XFS and exported via NFS to ESXi. This seems
to perform better than iSCSI+VMFS which seems to not play nicely with Ceph's PG 
contention issues particularly if working with thin
provisioned VMDK's.

 

I've still been noticing some performance issues however, mainly noticeable 
when doing any form of storage migrations. This is
largely due to the way vSphere transfers VM's in 64KB IO's at a QD of 32. 
vSphere does this so Arrays with QOS can balance the IO
easier than if larger IO's were submitted. However Ceph's PG locking means that 
only one or two of these IO's can happen at a time,
seriously lowering throughput. Typically you won't be able to push more than 
20-25MB/s during a storage migration

 

There is also another issue in that the IO needed for the XFS journal on the 
RBD, can cause contention and effectively also means
every NFS write IO sends 2 down to Ceph. This can have an impact on latency as 
well. Due to possible PG contention caused by the XFS
journal updates when multiple IO's are in flight, you normally end up making 
more and more RBD's to try and spread the load. This
normally means you end up having to do storage migrations...you can see where 
I'm getting at here.

 

I've been thinking for a while that CephFS works around a lot of these 
limitations. 

 

1.   It supports fancy striping, so should mean there is less per object 
contention

2.   There is no FS in the middle to maintain a journal and other 
associated IO

3.   A single large NFS mount should have none of the disadvantages seen 
with a single RBD

4.   No need to migrate VM's about because of #3

5.   No need to fstrim after deleting VM's

6.   Potential to do away with pacemaker and use LVS to do active/active 
NFS as ESXi does its own locking with files

 

With this in mind I exported a CephFS mount via NFS and then mounted it to an 
ESXi host as a test.

 

Initial results are looking very good. I'm seeing storage migrations to the NFS 
mount going at over 200MB/s, which equates to
several thousand IO's and seems to be writing at the intended QD32.

 

I need to do more testing to make sure everything works as intended, but like I 
say, promising initial results. 

 

Further testing needs to be done to see what sort of MDS performance is 
required, I would imagine that since we are mainly dealing
with large files, it might not be that critical. I also need to consider the 
stability of CephFS, RBD is relatively simple and is in
use by a large proportion of the Ceph community. CephFS is a lot easier to 
"upset".

 

Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com