just some thoughts and comments:
The default ceph setup uses 3 replicates on three different hosts, so
you need at least three hosts for a ceph cluster. Other configurations
with a smaller number of hosts are possible, but not recommended.
Depending on the workload and access pattern you can also store your
files on a EC pool which might improve the available capacity.
You also need to consider failure scenarios. With three hosts and a
default replicated pool (3 replicates) or the smallest EC pool (2+1),
the failure of any host will put your cluster into an undesired state
since it is not able to recover without a third active host to replicate
data to. The minimum number of hosts is thus (size of pools) + (number
of allowed failed hosts). To be on a somewhat safe side, you need at
least 4 hosts. More hosts will improve bandwidth, IOPS and reliability.
The default ceph setup is managing each disk separately using a OSD
daemon. These daemons will consume RAM. With 20 disks in a host, you
also need a fair amount of RAM. The latest OSD implementation
(bluestore) does not use the kernel page cache, so each OSD will consume
RAM independently from the others. With 20 disks pre host you need at
least 64 GB RAM for a sane setup (according to my gut, other may have
better numbers). More RAM is always desirable.
Part of the OSD information is stored in a key-value database. It is
highly advisable to put these databases on SSDs. You can share one SSD
for several OSDs (e.g. by creating partitions), but keep in mind that
the failure of one of these SSDs also renders the OSD content useless.
Do not use consumer grade SSDs. There are many discussions on the
mailing list about SSDs, just search the archive.
In addition to the OSD daemons a ceph cluster also requires monitoring
daemons ("mon"). They manage all metainformation needed to operate the
cluster (hosts running mons, hosts running osds, pool definitions,
authentication information and other stuff). If you lose your mons, you
will have a hard time recovering your cluster. You need an odd number of
mons for a sane quorum setup, so three mons are advisable. Do not run a
production cluster with one or two mons. The mons also need a small
chunk of fast, reliable storage, so use enterprise grade SSDs for this.
Some people also advise not to run mons colocated with osds on the same
host; we run them colocated with a software raid 1 over two NVME devices
(that are also used for our OSDs). Depending on your available hardware
you might want to put at least one mon on a small dedicated host.
If you want to use CephFS, you also need at least one metadata server
("mds"). For high availability a second standby instance is recommended.
For large filesystems with a large number of files (or a large number of
meta operations per second), you can also have several active mds
servers, as long as you have at least one standby instance. The mds
servers consume a large amount of RAM depending on their configuration
and the number of files open simultaneously. Our mds servers use about 7
GB RAM with ~ 2 million cached inodes and ~ 1 million capabilities
(ownership/locking information). Again it might not be advisable to run
mds server colocated with osds.
Low latency network is highly recommended between ceph cluster hosts and
the ceph clients. Use 10 GBit or better; 1 GBit works but the
performance will be bad. Also consider redundancy (e.g. LACP bonding of
links). The ceph cluster can use a public network for client access and
an internal network for the replication between osd hosts. Unless you
have special requirements I would recommend not to use a separate
internal network due to the higher configuration efforts. All ceph hosts
must be able to contact each other, and all ceph clients must be able to
contact the ceph hosts. As an example, reading a file from a CephFS
filesystem requires contacting a mon to retrieve the current mds, osd
and pool information (once during mount), contacting the mds to retrieve
the metadata for the file, and finally getting the data from the osd hosts.
Comparison to NFS:
There are a number of important differences to a standard NFS setup.
CephFS uses POSIX semantics, so file locks are enforced. Every file
access results in a roundtrip to the mds first to acquire the
"capabilities" to access the file. If the file is currently in use by
another client, the mds might contact that client and ask it to release
its capabilities (e.g. after the file was closed, but is still present
in the page cache). Applications relying on the less stringent NFS
semantics might have a severe performance impact.
CephFS also does not perform any mds side authorization based on unix
permissions (AFAIK). The access to the mds (and thus the filesystem) is
controlled by a shared ceph user secret. You do not have the ability to
use kerberos or server side unix group permissions checks. And you need
to trust the clients since you need to store the ceph user secret on the
You can export a CephFS filesystem via NFS either by re-exporting a
mountpoint or using ganesha NFS with native cephfs support. But the NFS
server will become the bottleneck and single point of failure in this
case. CephFS on the clients is the recommended setup.
As mentioned above, you need at least three hosts in an initial setup
and probably some hardware upgrade (RAM, SSDs). Do not be mislead by the
possibility to setup a single host cluster; I won't consider such a
setup even temporarly for migration purposes. It's an invitation for
Murphy to strike...
If you cannot free three hosts completely, you can also run both setup
side by side. Start with a small number of disks you can remove from the
raid setups on each host, convert them to ceph osds, migrate data
between the filesystems, proceed with more disks until all data is
migrated. One important aspect you need to consider is the fact that you
cannot change the number of coding/data chunks in a EC pool. If you want
to use a EC pool for the filesystem, you need to create it with the
number of coding/data chunks you want to have in the final setup.
A file is automatically split up into chunks of up to 4 MB size; each
chunk is mapped to a placement group in a cephfs data pool; a placement
group is mapped to a configurable number of OSDs (e.g. three OSDs on
three different hosts for a default replicated pool). One instance of
the placement group is the primary one; all IO operations are sent to
the OSD having this instance. In case of a write operation, the OSD will
pass the operation to the other OSDs involved; in case of a read
operation it either reads the data from disks and sends it back to the
client (replicated pool), or collects all data chunks from the other
OSDs, merges them and sends the data back to the client (EC pool). In
the end, you read the data from a single disk. You can expect the
performance of a single disk, which is usually worse than a full raid
array. Ceph is not tuned for fast single IO operations, but it scales
well with the number of ceph cluster hosts and ceph clients. Whether it
is fast enough depends on your workload and access pattern. Large files
may also benefit from a large readahead setting, resulting in parallel
access to multiple OSD hosts.
This is what ceph is designed for. A sane ceph setup (replicated pool
with three replicates, HA mon, HA mds) is almost indestructible. But you
need a good monitoring and follow certain rules like always keeping
enough free capacity for failure recovery.
If you have enough hosts and the correct hardware configuration for the
hosts and you are able to either convert complete hosts or indivual
disks to ceph, you should be able to migrate. Whether it is worth the
effort depends on your workload and IO requirements. I would highly
recommend to setup a test cluster before to get used to ceph
configuration and operations.
ceph-users mailing list