Re: [ceph-users] Crushmap from Rack aware to Node aware

David Turner Fri, 02 Jun 2017 06:48:49 -0700

You wouldn't be able to guarantee that the cluster will not use 2 servers
from the same rack. The problem with 3 failure domains, however, is if you
lose a full failure domain ceph can do nothing to maintain 3 copies of your
data. It leaves you in a position where you need to rush to the datacenter
to fix the hardware problems ASAP.


On Fri, Jun 2, 2017, 5:14 AM Laszlo Budai <las...@componentsoft.eu> wrote:

> Hi David,
>
> If I understand correctly your suggestion is the following:
> If we have for instance 12 servers grouped into 3 racks (4/rack) then you
> would build a crush map saying that you have 6 racks (virtual ones), and 2
> servers in each of them, right?
>
> In this case if we are setting the failure domain to rack and the size of
> a pool to 3, how do you make sure that the crush map will not use 2 servers
> from the same physical rack for a PG? Could you provide an example of
> distribution of servers to virtual racks?
>
> Thank you,
> Laszlo
>
>
> On 01.06.2017 22:23, David Turner wrote:
> > The way to do this is to download your crush map, modify it manually
> after decompiling it to text format or modify it using the crushtool.  Once
> you have your crush map with the rules in place that you want, you will
> upload the crush map to the cluster.  When you change your failure domain
> from host to rack, or any other change to failure domain, it will cause all
> of your PGs to peer at the same time.  You want to make sure that you have
> enough memory to handle this scenario.  After that point, your cluster will
> just backfill the PGs from where they currently are to their new location
> and then clean up after itself.  It is recommended to monitor your cluster
> usage and modify osd_max_backfills during this process to optimize how fast
> you can finish your backfilling while keeping your cluster usable by the
> clients.
> >
> > I generally recommend starting a cluster with at least n+2 failure
> domains so would recommend against going to a rack failure domain with only
> 3 racks.  As an alternative that I've done, I've set up 6 "racks" when I
> only have 3 racks with planned growth to a full 6 racks.  When I added
> servers and expanded to fill more racks, I moved the servers to where they
> are represented in the crush map.  So if it's physically in rack1 but it's
> set as rack4 in the crush map, then I would move those servers to the
> physical rack 4 and start filling out rack 1 and rack 4 to complete their
> capacity, then do the same for rack 2/5 when I start into the 5th rack.
> >
> > Another option to having full racks in your crush map is having half
> racks.  I've also done this for clusters that wouldn't grow larger than 3
> racks.  Have 6 failure domains at half racks.  It lowers your chance of
> having random drives fail in different failure domains at the same time and
> gives you more servers that you can run maintenance on at a time over using
> a host failure domain.  It doesn't resolve the issue of using a single
> cross-link for the entire rack or a full power failure of the rack, but
> it's closer.
> >
> > The problem with having 3 failure domains with replica 3 is that if you
> lose a complete failure domain, then you have nowhere for the 3rd replica
> to go.  If you have 4 failure domains with replica 3 and you lose an entire
> failure domain, then you over fill the remaining 3 failure domains and can
> only really use 55% of your cluster capacity.  If you have 5 failure
> domains, then you start normalizing and losing a failure domain doesn't
> impact as severely.  The more failure domains you get to, the less it
> affects you when you lose one.
> >
> > Let's do another scenario with 3 failure domains and replica size 3.
> Every OSD you lose inside of a failure domain gets backfilled directly onto
> the remaining OSDs in that failure domain.  There reaches a point where a
> switch failure in a rack or losing a node in the rack could over-fill the
> remaining OSDs in that rack.  If you have enough servers and OSDs in the
> rack, then this becomes moot.... but if you have a smaller cluster with
> only 3 nodes and <4 drives in each... if you lose a drive in one of your
> nodes, then all of it's data gets distributed to the other 3 drives in that
> node.  That means you either have to replace your storage ASAP when it
> fails or never fill your cluster up more than 55% if you want to be able to
> automatically recover from a drive failure.
> >
> > tl;dr . Make sure you calculate what your failure domain, replica size,
> drive size, etc means for how fast you have to replace storage when it
> fails and how full you can fill your cluster to afford a hardware loss.
> >
> > On Thu, Jun 1, 2017 at 12:40 PM Deepak Naidu <dna...@nvidia.com <mailto:
> dna...@nvidia.com>> wrote:
> >
> >     Greetings Folks.____
> >
> >     __ __
> >
> >     Wanted to understand how ceph works when we start with rack
> aware(rack level replica) example 3 racks and 3 replica in crushmap in
> future is replaced by node aware(node level replica) ie 3 replica spread
> across nodes.____
> >
> >     __ __
> >
> >     This can be vice-versa. If this happens. How does ceph rearrange the
> “old” data. Do I need to trigger any command to ensure the data placement
> is based on latest crushmap algorithm or ceph takes care of it
> automatically.____
> >
> >     __ __
> >
> >     Thanks for your time.____
> >
> >     __ __
> >
> >     --____
> >
> >     Deepak____
> >
> >
>  
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >     This email message is for the sole use of the intended recipient(s)
> and may contain confidential information.  Any unauthorized review, use,
> disclosure or distribution is prohibited.  If you are not the intended
> recipient, please contact the sender by reply email and destroy all copies
> of the original message.
> >
>  
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >     _______________________________________________
> >     ceph-users mailing list
> >     ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Crushmap from Rack aware to Node aware

Reply via email to