Wow.. thanks for such a detailed reply!

On 10 August 2018 at 07:08, Burkhard Linke <
[email protected]> wrote:

> The default ceph setup uses 3 replicates on three different hosts, so you
> need at least three hosts for a ceph cluster. Other configurations with a
> smaller number of hosts are possible, but not recommended. Depending on the
> workload and access pattern you can also store your files on a EC pool
> which might improve the available capacity.
>

We're starting with five large file servers, so the running cluster, once
converted, should be fine.  However, it sounds like I might have problems
with the migration path.  More below.


> The default ceph setup is managing each disk separately using a OSD
> daemon. These daemons will consume RAM. With 20 disks in a host, you also
> need a fair amount of RAM. The latest OSD implementation (bluestore) does
> not use the kernel page cache, so each OSD will consume RAM independently
> from the others. With 20 disks pre host you need at least 64 GB RAM for a
> sane setup (according to my gut, other may have better numbers). More RAM
> is always desirable.
>

I think we're fine here.  ZFS is a large memory consumer as well, and we're
already set up for that.

Part of the OSD information is stored in a key-value database. It is highly
> advisable to put these databases on SSDs. You can share one SSD for several
> OSDs (e.g. by creating partitions), but keep in mind that the failure of
> one of these SSDs also renders the OSD content useless. Do not use consumer
> grade SSDs. There are many discussions on the mailing list about SSDs, just
> search the archive.
>

You're referring to the journal, here?  Yes, I'd read the Hardware
Recommendations document that suggests that.  It doesn't seem to suggest
that partitioning of the SSD is necessary, though.. but possible if
desired.  I haven't (yet) found any recommendations on sizing an SSD, and I
wonder if I can take that to mean that the journal is so small that size is
rarely a concern.


> In addition to the OSD daemons a ceph cluster also requires monitoring
> daemons ("mon"). They manage all metainformation needed to operate the
> cluster (hosts running mons, hosts running osds, pool definitions,
> authentication information and other stuff). If you lose your mons, you
> will have a hard time recovering your cluster. You need an odd number of
> mons for a sane quorum setup, so three mons are advisable. Do not run a
> production cluster with one or two mons. The mons also need a small chunk
> of fast, reliable storage, so use enterprise grade SSDs for this. Some
> people also advise not to run mons colocated with osds on the same host; we
> run them colocated with a software raid 1 over two NVME devices (that are
> also used for our OSDs). Depending on your available hardware you might
> want to put at least one mon on a small dedicated host.
>

Yes, this is one area were the deployment plan is to start by colocating
the mons, and improve the situation over time.. moving to dedicated mons as
we do hardware refreshes.  The mons would likely share the SSDs with the
OSD journal at first.


>
> If you want to use CephFS, you also need at least one metadata server
> ("mds"). For high availability a second standby instance is recommended.
> For large filesystems with a large number of files (or a large number of
> meta operations per second), you can also have several active mds servers,
> as long as you have at least one standby instance. The mds servers consume
> a large amount of RAM depending on their configuration and the number of
> files open simultaneously. Our mds servers use about 7 GB RAM with ~ 2
> million cached inodes and ~ 1 million capabilities (ownership/locking
> information). Again it might not be advisable to run mds server colocated
> with osds.
>

Indeed, especially in our initial configuration the memory requirements of
the OSDs are so high that I can't imagine we'd be able to run the MDSs on
the same hosts.


>
> Low latency network is highly recommended between ceph cluster hosts and
> the ceph clients. Use 10 GBit or better; 1 GBit works but the performance
> will be bad. Also consider redundancy (e.g. LACP bonding of links). The
> ceph cluster can use a public network for client access and an internal
> network for the replication between osd hosts. Unless you have special
> requirements I would recommend not to use a separate internal network due
> to the higher configuration efforts. All ceph hosts must be able to contact
> each other, and all ceph clients must be able to contact the ceph hosts. As
> an example, reading a file from a CephFS filesystem requires contacting a
> mon to retrieve the current mds, osd and pool information (once during
> mount), contacting the mds to retrieve the metadata for the file, and
> finally getting the data from the osd hosts.
>

The file servers and their client machines are on a shared 10G network
already, so I think we're good there.


> CephFS also does not perform any mds side authorization based on unix
> permissions (AFAIK). The access to the mds (and thus the filesystem) is
> controlled by a shared ceph user secret. You do not have the ability to use
> kerberos or server side unix group permissions checks. And you need to
> trust the clients since you need to store the ceph user secret on the
> client.
>

I think we're probably okay here.  Reads and writes are split up into
separate machines, and the ones that read have NFS mounts set read-only.
We don't currently have a requirement to allow some users on a read host to
be prevented from reading some data.  But, let me speculate for a moment on
some hypothetical future where we've got different access requirements for
different data sets.  If we can't restrict access to files by unix
user/group permissions, would it make sense to run multiple clusters on the
same hardware, having some OSDs on a host participate in one cluster, while
other OSDs participate in a second, and share them out as separate CephFS
mounts?  Access could be controlled above the mount point in the
filesystem, that way.


>
> As mentioned above, you need at least three hosts in an initial setup and
> probably some hardware upgrade (RAM, SSDs). Do not be mislead by the
> possibility to setup a single host cluster; I won't consider such a setup
> even temporarly for migration purposes. It's an invitation for Murphy to
> strike...
>
> If you cannot free three hosts completely, you can also run both setup
> side by side. Start with a small number of disks you can remove from the
> raid setups on each host, convert them to ceph osds, migrate data between
> the filesystems, proceed with more disks until all data is migrated. One
> important aspect you need to consider is the fact that you cannot change
> the number of coding/data chunks in a EC pool. If you want to use a EC pool
> for the filesystem, you need to create it with the number of coding/data
> chunks you want to have in the final setup.
>

This sounds like a major roadblock... or at least a delay.  Shrinking
volumes in ZFS is not (yet?) possible except by destroying the filesystem
and rebuilding.  I'm pretty sure the MD/LVM2 RAID has the same limitation.
So, in order to run a migration side-by-side I need to be able to empty out
an incredibly large volume of data anyway. I haven't worked through the
whole procedure yet, but my gut feeling is that I'd still need to empty out
the equivalent of two file servers to make it work.  I can get one empty
during our regular annual disk refresh, but two would require eliminating
cross-chassis file duplication "temporarily" while data is moved around.
Considering that it takes weeks to rebalance that much data this wouldn't
be a popular idea.

I can picture being able to work around this if we replace any single file
server with a set of smaller file servers, so perhaps we can make it work
in a future hardware refresh.

I'm skipping over the rest of your email because I don't see anything there
that's of concern.

Thanks a ton for your reply and the time you put into it!  That was
incredibly helpful in planning a safe path forward.
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to