just some thoughts and comments:


The default ceph setup uses 3 replicates on three different hosts, so you need at least three hosts for a ceph cluster. Other configurations with a smaller number of hosts are possible, but not recommended. Depending on the workload and access pattern you can also store your files on a EC pool which might improve the available capacity.

You also need to consider failure scenarios. With three hosts and a default replicated pool (3 replicates) or the smallest EC pool (2+1), the failure of any host will put your cluster into an undesired state since it is not able to recover without a third active host to replicate data to. The minimum number of hosts is thus (size of pools) + (number of allowed failed hosts). To be on a somewhat safe side, you need at least 4 hosts. More hosts will improve bandwidth, IOPS and reliability.

The default ceph setup is managing each disk separately using a OSD daemon. These daemons will consume RAM. With 20 disks in a host, you also need a fair amount of RAM. The latest OSD implementation (bluestore) does not use the kernel page cache, so each OSD will consume RAM independently from the others. With 20 disks pre host you need at least 64 GB RAM for a sane setup (according to my gut, other may have better numbers). More RAM is always desirable.

Part of the OSD information is stored in a key-value database. It is highly advisable to put these databases on SSDs. You can share one SSD for several OSDs (e.g. by creating partitions), but keep in mind that the failure of one of these SSDs also renders the OSD content useless. Do not use consumer grade SSDs. There are many discussions on the mailing list about SSDs, just search the archive.

In addition to the OSD daemons a ceph cluster also requires monitoring daemons ("mon"). They manage all metainformation needed to operate the cluster (hosts running mons, hosts running osds, pool definitions, authentication information and other stuff). If you lose your mons, you will have a hard time recovering your cluster. You need an odd number of mons for a sane quorum setup, so three mons are advisable. Do not run a production cluster with one or two mons. The mons also need a small chunk of fast, reliable storage, so use enterprise grade SSDs for this. Some people also advise not to run mons colocated with osds on the same host; we run them colocated with a software raid 1 over two NVME devices (that are also used for our OSDs). Depending on your available hardware you might want to put at least one mon on a small dedicated host.

If you want to use CephFS, you also need at least one metadata server ("mds"). For high availability a second standby instance is recommended. For large filesystems with a large number of files (or a large number of meta operations per second), you can also have several active mds servers, as long as you have at least one standby instance. The mds servers consume a large amount of RAM depending on their configuration and the number of files open simultaneously. Our mds servers use about 7 GB RAM with ~ 2 million cached inodes and ~ 1 million capabilities (ownership/locking information). Again it might not be advisable to run mds server colocated with osds.

Low latency network is highly recommended between ceph cluster hosts and the ceph clients. Use 10 GBit or better; 1 GBit works but the performance will be bad. Also consider redundancy (e.g. LACP bonding of links). The ceph cluster can use a public network for client access and an internal network for the replication between osd hosts. Unless you have special requirements I would recommend not to use a separate internal network due to the higher configuration efforts. All ceph hosts must be able to contact each other, and all ceph clients must be able to contact the ceph hosts. As an example, reading a file from a CephFS filesystem requires contacting a mon to retrieve the current mds, osd and pool information (once during mount), contacting the mds to retrieve the metadata for the file, and finally getting the data from the osd hosts.

Comparison to NFS:

There are a number of important differences to a standard NFS setup. CephFS uses POSIX semantics, so file locks are enforced. Every file access results in a roundtrip to the mds first to acquire the "capabilities" to access the file. If the file is currently in use by another client, the mds might contact that client and ask it to release its capabilities (e.g. after the file was closed, but is still present in the page cache). Applications relying on the less stringent NFS semantics might have a severe performance impact.

CephFS also does not perform any mds side authorization based on unix permissions (AFAIK). The access to the mds (and thus the filesystem) is controlled by a shared ceph user secret. You do not have the ability to use kerberos or server side unix group permissions checks. And you need to trust the clients since you need to store the ceph user secret on the client.

You can export a CephFS filesystem via NFS either by re-exporting a mountpoint or using ganesha NFS with native cephfs support. But the NFS server will become the bottleneck and single point of failure in this case. CephFS on the clients is the recommended setup.


As mentioned above, you need at least three hosts in an initial setup and probably some hardware upgrade (RAM, SSDs). Do not be mislead by the possibility to setup a single host cluster; I won't consider such a setup even temporarly for migration purposes. It's an invitation for Murphy to strike...

If you cannot free three hosts completely, you can also run both setup side by side. Start with a small number of disks you can remove from the raid setups on each host, convert them to ceph osds, migrate data between the filesystems, proceed with more disks until all data is migrated. One important aspect you need to consider is the fact that you cannot change the number of coding/data chunks in a EC pool. If you want to use a EC pool for the filesystem, you need to create it with the number of coding/data chunks you want to have in the final setup.


A file is automatically split up into chunks of up to 4 MB size; each chunk is mapped to a placement group in a cephfs data pool; a placement group is mapped to a configurable number of OSDs (e.g. three OSDs on three different hosts for a default replicated pool). One instance of the placement group is the primary one; all IO operations are sent to the OSD having this instance. In case of a write operation, the OSD will pass the operation to the other OSDs involved; in case of a read operation it either reads the data from disks and sends it back to the client (replicated pool), or collects all data chunks from the other OSDs, merges them and sends the data back to the client (EC pool). In the end, you read the data from a single disk. You can expect the performance of a single disk, which is usually worse than a full raid array. Ceph is not tuned for fast single IO operations, but it scales well with the number of ceph cluster hosts and ceph clients. Whether it is fast enough depends on your workload and access pattern. Large files may also benefit from a large readahead setting, resulting in parallel access to multiple OSD hosts.


This is what ceph is designed for. A sane ceph setup (replicated pool with three replicates, HA mon, HA mds) is almost indestructible. But you need a good monitoring and follow certain rules like always keeping enough free capacity for failure recovery.


If you have enough hosts and the correct hardware configuration for the hosts and you are able to either convert complete hosts or indivual disks to ceph, you should be able to migrate. Whether it is worth the effort depends on your workload and IO requirements. I would highly recommend to setup a test cluster before to get used to ceph configuration and operations.



ceph-users mailing list

Reply via email to