You wouldn't be able to guarantee that the cluster will not use 2 servers from the same rack. The problem with 3 failure domains, however, is if you lose a full failure domain ceph can do nothing to maintain 3 copies of your data. It leaves you in a position where you need to rush to the datacenter to fix the hardware problems ASAP.
On Fri, Jun 2, 2017, 5:14 AM Laszlo Budai <las...@componentsoft.eu> wrote: > Hi David, > > If I understand correctly your suggestion is the following: > If we have for instance 12 servers grouped into 3 racks (4/rack) then you > would build a crush map saying that you have 6 racks (virtual ones), and 2 > servers in each of them, right? > > In this case if we are setting the failure domain to rack and the size of > a pool to 3, how do you make sure that the crush map will not use 2 servers > from the same physical rack for a PG? Could you provide an example of > distribution of servers to virtual racks? > > Thank you, > Laszlo > > > On 01.06.2017 22:23, David Turner wrote: > > The way to do this is to download your crush map, modify it manually > after decompiling it to text format or modify it using the crushtool. Once > you have your crush map with the rules in place that you want, you will > upload the crush map to the cluster. When you change your failure domain > from host to rack, or any other change to failure domain, it will cause all > of your PGs to peer at the same time. You want to make sure that you have > enough memory to handle this scenario. After that point, your cluster will > just backfill the PGs from where they currently are to their new location > and then clean up after itself. It is recommended to monitor your cluster > usage and modify osd_max_backfills during this process to optimize how fast > you can finish your backfilling while keeping your cluster usable by the > clients. > > > > I generally recommend starting a cluster with at least n+2 failure > domains so would recommend against going to a rack failure domain with only > 3 racks. As an alternative that I've done, I've set up 6 "racks" when I > only have 3 racks with planned growth to a full 6 racks. When I added > servers and expanded to fill more racks, I moved the servers to where they > are represented in the crush map. So if it's physically in rack1 but it's > set as rack4 in the crush map, then I would move those servers to the > physical rack 4 and start filling out rack 1 and rack 4 to complete their > capacity, then do the same for rack 2/5 when I start into the 5th rack. > > > > Another option to having full racks in your crush map is having half > racks. I've also done this for clusters that wouldn't grow larger than 3 > racks. Have 6 failure domains at half racks. It lowers your chance of > having random drives fail in different failure domains at the same time and > gives you more servers that you can run maintenance on at a time over using > a host failure domain. It doesn't resolve the issue of using a single > cross-link for the entire rack or a full power failure of the rack, but > it's closer. > > > > The problem with having 3 failure domains with replica 3 is that if you > lose a complete failure domain, then you have nowhere for the 3rd replica > to go. If you have 4 failure domains with replica 3 and you lose an entire > failure domain, then you over fill the remaining 3 failure domains and can > only really use 55% of your cluster capacity. If you have 5 failure > domains, then you start normalizing and losing a failure domain doesn't > impact as severely. The more failure domains you get to, the less it > affects you when you lose one. > > > > Let's do another scenario with 3 failure domains and replica size 3. > Every OSD you lose inside of a failure domain gets backfilled directly onto > the remaining OSDs in that failure domain. There reaches a point where a > switch failure in a rack or losing a node in the rack could over-fill the > remaining OSDs in that rack. If you have enough servers and OSDs in the > rack, then this becomes moot.... but if you have a smaller cluster with > only 3 nodes and <4 drives in each... if you lose a drive in one of your > nodes, then all of it's data gets distributed to the other 3 drives in that > node. That means you either have to replace your storage ASAP when it > fails or never fill your cluster up more than 55% if you want to be able to > automatically recover from a drive failure. > > > > tl;dr . Make sure you calculate what your failure domain, replica size, > drive size, etc means for how fast you have to replace storage when it > fails and how full you can fill your cluster to afford a hardware loss. > > > > On Thu, Jun 1, 2017 at 12:40 PM Deepak Naidu <dna...@nvidia.com <mailto: > dna...@nvidia.com>> wrote: > > > > Greetings Folks.____ > > > > __ __ > > > > Wanted to understand how ceph works when we start with rack > aware(rack level replica) example 3 racks and 3 replica in crushmap in > future is replaced by node aware(node level replica) ie 3 replica spread > across nodes.____ > > > > __ __ > > > > This can be vice-versa. If this happens. How does ceph rearrange the > “old” data. Do I need to trigger any command to ensure the data placement > is based on latest crushmap algorithm or ceph takes care of it > automatically.____ > > > > __ __ > > > > Thanks for your time.____ > > > > __ __ > > > > --____ > > > > Deepak____ > > > > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > This email message is for the sole use of the intended recipient(s) > and may contain confidential information. Any unauthorized review, use, > disclosure or distribution is prohibited. If you are not the intended > recipient, please contact the sender by reply email and destroy all copies > of the original message. > > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com