Re: [ceph-users] Applicability and migration path

Burkhard Linke Fri, 10 Aug 2018 04:09:01 -0700

Hi,


just some thoughts and comments:



Hardware:

The default ceph setup uses 3 replicates on three different hosts, soyou need at least three hosts for a ceph cluster. Other configurationswith a smaller number of hosts are possible, but not recommended.Depending on the workload and access pattern you can also store yourfiles on a EC pool which might improve the available capacity.

You also need to consider failure scenarios. With three hosts and adefault replicated pool (3 replicates) or the smallest EC pool (2+1),the failure of any host will put your cluster into an undesired statesince it is not able to recover without a third active host to replicatedata to. The minimum number of hosts is thus (size of pools) + (numberof allowed failed hosts). To be on a somewhat safe side, you need atleast 4 hosts. More hosts will improve bandwidth, IOPS and reliability.

The default ceph setup is managing each disk separately using a OSDdaemon. These daemons will consume RAM. With 20 disks in a host, youalso need a fair amount of RAM. The latest OSD implementation(bluestore) does not use the kernel page cache, so each OSD will consumeRAM independently from the others. With 20 disks pre host you need atleast 64 GB RAM for a sane setup (according to my gut, other may havebetter numbers). More RAM is always desirable.

Part of the OSD information is stored in a key-value database. It ishighly advisable to put these databases on SSDs. You can share one SSDfor several OSDs (e.g. by creating partitions), but keep in mind thatthe failure of one of these SSDs also renders the OSD content useless.Do not use consumer grade SSDs. There are many discussions on themailing list about SSDs, just search the archive.

In addition to the OSD daemons a ceph cluster also requires monitoringdaemons ("mon"). They manage all metainformation needed to operate thecluster (hosts running mons, hosts running osds, pool definitions,authentication information and other stuff). If you lose your mons, youwill have a hard time recovering your cluster. You need an odd number ofmons for a sane quorum setup, so three mons are advisable. Do not run aproduction cluster with one or two mons. The mons also need a smallchunk of fast, reliable storage, so use enterprise grade SSDs for this.Some people also advise not to run mons colocated with osds on the samehost; we run them colocated with a software raid 1 over two NVME devices(that are also used for our OSDs). Depending on your available hardwareyou might want to put at least one mon on a small dedicated host.

If you want to use CephFS, you also need at least one metadata server("mds"). For high availability a second standby instance is recommended.For large filesystems with a large number of files (or a large number ofmeta operations per second), you can also have several active mdsservers, as long as you have at least one standby instance. The mdsservers consume a large amount of RAM depending on their configurationand the number of files open simultaneously. Our mds servers use about 7GB RAM with ~ 2 million cached inodes and ~ 1 million capabilities(ownership/locking information). Again it might not be advisable to runmds server colocated with osds.

Low latency network is highly recommended between ceph cluster hosts andthe ceph clients. Use 10 GBit or better; 1 GBit works but theperformance will be bad. Also consider redundancy (e.g. LACP bonding oflinks). The ceph cluster can use a public network for client access andan internal network for the replication between osd hosts. Unless youhave special requirements I would recommend not to use a separateinternal network due to the higher configuration efforts. All ceph hostsmust be able to contact each other, and all ceph clients must be able tocontact the ceph hosts. As an example, reading a file from a CephFSfilesystem requires contacting a mon to retrieve the current mds, osdand pool information (once during mount), contacting the mds to retrievethe metadata for the file, and finally getting the data from the osd hosts.



Comparison to NFS:

There are a number of important differences to a standard NFS setup.CephFS uses POSIX semantics, so file locks are enforced. Every fileaccess results in a roundtrip to the mds first to acquire the"capabilities" to access the file. If the file is currently in use byanother client, the mds might contact that client and ask it to releaseits capabilities (e.g. after the file was closed, but is still presentin the page cache). Applications relying on the less stringent NFSsemantics might have a severe performance impact.

CephFS also does not perform any mds side authorization based on unixpermissions (AFAIK). The access to the mds (and thus the filesystem) iscontrolled by a shared ceph user secret. You do not have the ability touse kerberos or server side unix group permissions checks. And you needto trust the clients since you need to store the ceph user secret on theclient.

You can export a CephFS filesystem via NFS either by re-exporting amountpoint or using ganesha NFS with native cephfs support. But the NFSserver will become the bottleneck and single point of failure in thiscase. CephFS on the clients is the recommended setup.



Migration:

As mentioned above, you need at least three hosts in an initial setupand probably some hardware upgrade (RAM, SSDs). Do not be mislead by thepossibility to setup a single host cluster; I won't consider such asetup even temporarly for migration purposes. It's an invitation forMurphy to strike...

If you cannot free three hosts completely, you can also run both setupside by side. Start with a small number of disks you can remove from theraid setups on each host, convert them to ceph osds, migrate databetween the filesystems, proceed with more disks until all data ismigrated. One important aspect you need to consider is the fact that youcannot change the number of coding/data chunks in a EC pool. If you wantto use a EC pool for the filesystem, you need to create it with thenumber of coding/data chunks you want to have in the final setup.



Speed:

A file is automatically split up into chunks of up to 4 MB size; eachchunk is mapped to a placement group in a cephfs data pool; a placementgroup is mapped to a configurable number of OSDs (e.g. three OSDs onthree different hosts for a default replicated pool). One instance ofthe placement group is the primary one; all IO operations are sent tothe OSD having this instance. In case of a write operation, the OSD willpass the operation to the other OSDs involved; in case of a readoperation it either reads the data from disks and sends it back to theclient (replicated pool), or collects all data chunks from the otherOSDs, merges them and sends the data back to the client (EC pool). Inthe end, you read the data from a single disk. You can expect theperformance of a single disk, which is usually worse than a full raidarray. Ceph is not tuned for fast single IO operations, but it scaleswell with the number of ceph cluster hosts and ceph clients. Whether itis fast enough depends on your workload and access pattern. Large filesmay also benefit from a large readahead setting, resulting in parallelaccess to multiple OSD hosts.



Reliabilty:

This is what ceph is designed for. A sane ceph setup (replicated poolwith three replicates, HA mon, HA mds) is almost indestructible. But youneed a good monitoring and follow certain rules like always keepingenough free capacity for failure recovery.



TL;DR:

If you have enough hosts and the correct hardware configuration for thehosts and you are able to either convert complete hosts or indivualdisks to ceph, you should be able to migrate. Whether it is worth theeffort depends on your workload and IO requirements. I would highlyrecommend to setup a test cluster before to get used to cephconfiguration and operations.



Regards,

Burkhard


_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Applicability and migration path

Reply via email to