First off to answer your questions about mons, you need to understand that
they work in a Paxos Quorum.  What that means is that there needs to be a
majority of Mons that agree that they are in charge.  This is why even
numbers of mons is a bad idea as they can potentially split themselves in
half.  For this case, let's say you have 3 mons.  2 of them need to be up
and communicating for them to agree that they can respond to clients.  If
the third mon is online, but networking troubles are keeping it from
communicating with the other 2 mons, it will realize that it isn't a part
of the quorum and will refuse to respond to anyone that asks it questions.
I think there might be some logic for allowing 1 mon to manage the cluster,
but I think that works best if the other mons properly shut down informing
the other mons that they are going offline so it isn't up to a vote for who
is in charge.

Lifecycle of a client and a mon.  When a client first communicates with a
Ceph cluster it uses the mon_host setting in its ceph.conf file to know who
the mons are.  It goes through the list until it gets one that will
authoritatively respond for the cluster and give it the osd map.  Now that
it has an osd map it can start communicating with all of the osds in the
cluster, reading, writing, mounting, etc.  This is usually where a client
stops talking to mons.  As a client is talking with osds, the osds will
respond back with updated maps if there are any.  This change was made in
the Hammer release of Ceph.  Before that, all map updates were handled by
the mons and it was a huge burden on them causing them to prevent a cluster
from growing larger than about 1,000 osds because the mons couldn't handle
managing the maps for any more osds.  In Hammer, and still happening today,
osds started updating each other's osd maps as they communicated with each
other.  If anything is confused as to which map to use, they still ask the
mon and the mon will tell them the right one.

If a mon goes down, then the rest of the mon_host will be used to know who
to contact.  It might fail on a down mon, but it will retry and get to one
that is online.  Mons are the keeper of cephx auth keys and map versions,
but other than that, they really don't impact performance much.  Everything
else is handled by the algorithms in the osd map that tell a client where
all objects and osds are in the cluster and the majority of map updates
will come from the communication with the osds.

Back to VMs and librbd vs krbd (which is /dev/rbd* devices).  The kernel
driver does not have feature parity with Ceph.  Even the latest kernel does
not support all Ceph RBD features and you will have to disable them in your
cluster.  This disables things like object map which is how Ceph keeps
track of which objects do and don't exist in an RBD.  Without object map
Ceph has to assume that every object that can exist in an RBD does.  With
object map, if you delete an RBD Ceph issues a delete to only the objects
that exist, without it Ceph has to attempt to delete every object
regardless if it exists.  Checking the used space of an RBD with object map
is instant, checking it without object map can take several minutes on RBDs
that are only 100GB in size (this is even worse if you are using snapshots
as it has to check for every object that can possibly exist on the RBD
itself as well as the snapshots).

librbd has feature parity with Ceph as it is updated and the same version
as Ceph with every release.  krbd is still trying to implement RBD features
released over a year ago.  I prefer to use the Ceph libraries as often as
possible, then the fuse drivers (except rbd-fuse because it is slower than
dirt), and if I have no other choice then I'll use the kernel drivers.
When it comes to choosing a hypervisor for hosting VMs on RBDs, there is no
question in my mind that I would only look at options that use librbd.

On Tue, Feb 13, 2018 at 6:13 PM Egoitz Aurrekoetxea <>

> Hi David!!
> Thanks a lot for your answer. But what happens when you have... imagine
> two monitors or more and one of them becomes unreponsive?. Another one is
> used after a timeout or... what happens when a client wants to access to
> some data, needs to query for that (for knowing where the info is) a
> monitor and does not answer?. A monitor that becomes not responsive is
> discarded for the following queries of where the data exists in the
> cluster?.
> So saying in some way... you wont use when talking in terms of performance
> any kind of solution not accessing through librbd?. Is the performance poor
> or bad when using /dev/rbdX devices mounted?. Or perhaps you say in terms
> of data integrity?.
> I was planning to use Xen with Cepth but after your advine ... 😀. Would
> you definitively to with KVM?.
> Thanks a lot again 😉
> Chefs,
> Egoitz,
> El 13 feb 2018, a las 20:19, David Turner <>
> escribió:
> Monitors are not required for accessing data from the Ceph cluster.
> Clients will ask a monitor for a current OSD map and then use that OSD map
> to communicate with the OSDs directly for all reads and writes.  The map
> includes the crush map which has all of the information a client needs to
> know where every object is in the cluster.  Having 3 mons is a good number
> for small deployments.  5 mons is better for better redundancy in the
> monitor quorum.  Avoid an even number of mons always.
> librbd is definitely the way to go for accessing RBDs for a hypervisor as
> opposed to fuse or krbd.  For a quick and easy hypervisor using Ceph, I
> like Proxmox.  It natively has the ability to use KVM with Ceph without
> having to configure it yourself.  It comes with a nice gui as well to see
> the console screen for your VMs.  It also has a fairly simple guide to
> cluster hypervisors together to provide HA support for your VMs.  For
> larger scale VM deployments, Openstack is probably the way I would go.
> On Tue, Feb 13, 2018 at 2:11 PM Egoitz Aurrekoetxea <>
> wrote:
>> Good afternoon,
>> As I'm new to Ceph I was wondering what could be the most proper way to
>> use it with Xen hypervisor (with a plain Linux installation, Centos, for
>> instance). Have read the less proper one is to just
>> mount the /dev/rbdX device in a mount point and just showing that space
>> to the Hypervisor but I see it pretty easy and seems stable. Seems not
>> to perform bad... Is it better to use for instance librbd
>> with KVM?. Does it perform better?.
>> By the way, it seems to use the monitor node in order to access to the
>> space in the osd cluster. Have read too that Ceph has been designed
>> keeping in mind no single points of failure but... is it possible
>> to configure several monitor nodes, and then after a very little timeout
>> or similar to access to the file system through the other nodes?. What
>> could be the most proper way of configuring this for avoiding a
>> machine to loose the storage if the monitor fails?. Could you point
>> please me in the right direction?. Perhaps with several monitors or....
>> By the way if you could consider it would be better to use another
>> hypervisor or config (with librados or whatever) with Ceph, could you
>> please suggest me too?. Help to the newbie :p :) :)
>> Best regards,
>> _______________________________________________
>> ceph-users mailing list
ceph-users mailing list

Reply via email to