Oh, you mean monitor quorum is enforced? I never really considered that.
However, I think I found another solution:
I created a second tree called "ldc" and under it I made 3 "logical
datacenters" (waiting for a better name) and grouped the servers under
it so that one logical datacenter contains 3 servers, one ssd and 2 hdd
selected from different physical datacenters. I could now rewrite my
hybrid rule to simply select one datacenter and then 3 hostgroups from
it. I made a new bucket type called "hostgroup" that I put the physical
servers under, so that it is easy to add more servers in the future
(just add them to the correct host group)
It should work, I will test fully this coming week.
Complete crushmap is below. Rules and stuff for the other two more
normal rules are the same, interesting stuff starts about half way down.
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54
# devices
device 0 osd.0 class nvme
device 1 osd.1 class nvme
device 2 osd.2 class nvme
device 3 osd.3 class nvme
device 4 osd.4 class nvme
device 5 osd.5 class nvme
device 6 osd.6 class nvme
device 7 osd.7 class nvme
device 8 osd.8 class nvme
device 9 osd.9 class nvme
device 10 osd.10 class nvme
device 11 osd.11 class nvme
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
device 20 osd.20 class hdd
device 21 osd.21 class hdd
device 22 osd.22 class hdd
device 23 osd.23 class hdd
device 24 osd.24 class hdd
device 25 osd.25 class hdd
device 26 osd.26 class hdd
device 27 osd.27 class hdd
device 28 osd.28 class hdd
device 29 osd.29 class hdd
device 30 osd.30 class hdd
device 31 osd.31 class hdd
device 32 osd.32 class hdd
device 33 osd.33 class hdd
device 34 osd.34 class hdd
device 35 osd.35 class hdd
# types
type 0 osd
type 1 host
type 2 hostgroup
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
# buckets
host storage11 {
id -5 # do not change unnecessarily
id -6 class nvme # do not change unnecessarily
id -10 class hdd # do not change unnecessarily
# weight 2.913
alg straw2
hash 0 # rjenkins1
item osd.0 weight 0.729
item osd.3 weight 0.728
item osd.6 weight 0.728
item osd.9 weight 0.728
}
host storage21 {
id -13 # do not change unnecessarily
id -14 class nvme # do not change unnecessarily
id -15 class hdd # do not change unnecessarily
# weight 65.496
alg straw2
hash 0 # rjenkins1
item osd.12 weight 5.458
item osd.13 weight 5.458
item osd.14 weight 5.458
item osd.15 weight 5.458
item osd.16 weight 5.458
item osd.17 weight 5.458
item osd.18 weight 5.458
item osd.19 weight 5.458
item osd.20 weight 5.458
item osd.21 weight 5.458
item osd.22 weight 5.458
item osd.23 weight 5.458
}
datacenter HORN79 {
id -19 # do not change unnecessarily
id -26 class nvme # do not change unnecessarily
id -27 class hdd # do not change unnecessarily
# weight 68.406
alg straw2
hash 0 # rjenkins1
item storage11 weight 2.911
item storage21 weight 65.495
}
host storage13 {
id -7 # do not change unnecessarily
id -8 class nvme # do not change unnecessarily
id -11 class hdd # do not change unnecessarily
# weight 2.912
alg straw2
hash 0 # rjenkins1
item osd.2 weight 0.728
item osd.5 weight 0.728
item osd.8 weight 0.728
item osd.11 weight 0.728
}
host storage23 {
id -16 # do not change unnecessarily
id -17 class nvme # do not change unnecessarily
id -18 class hdd # do not change unnecessarily
# weight 65.496
alg straw2
hash 0 # rjenkins1
item osd.24 weight 5.458
item osd.25 weight 5.458
item osd.26 weight 5.458
item osd.27 weight 5.458
item osd.28 weight 5.458
item osd.29 weight 5.458
item osd.30 weight 5.458
item osd.31 weight 5.458
item osd.32 weight 5.458
item osd.33 weight 5.458
item osd.34 weight 5.458
item osd.35 weight 5.458
}
datacenter WAR {
id -20 # do not change unnecessarily
id -24 class nvme # do not change unnecessarily
id -25 class hdd # do not change unnecessarily
# weight 68.406
alg straw2
hash 0 # rjenkins1
item storage13 weight 2.911
item storage23 weight 65.495
}
host storage12 {
id -3 # do not change unnecessarily
id -4 class nvme # do not change unnecessarily
id -9 class hdd # do not change unnecessarily
# weight 2.912
alg straw2
hash 0 # rjenkins1
item osd.1 weight 0.728
item osd.4 weight 0.728
item osd.7 weight 0.728
item osd.10 weight 0.728
}
datacenter TEG4 {
id -21 # do not change unnecessarily
id -22 class nvme # do not change unnecessarily
id -23 class hdd # do not change unnecessarily
# weight 2.911
alg straw2
hash 0 # rjenkins1
item storage12 weight 2.911
}
root default {
id -1 # do not change unnecessarily
id -2 class nvme # do not change unnecessarily
id -12 class hdd # do not change unnecessarily
# weight 139.721
alg straw2
hash 0 # rjenkins1
item HORN79 weight 68.405
item WAR weight 68.405
item TEG4 weight 2.911
}
hostgroup hg1-1 {
id -30 # do not change unnecessarily
id -28 class nvme # do not change unnecessarily
id -54 class hdd # do not change unnecessarily
# weight 2.913
alg straw2
hash 0 # rjenkins1
item storage11 weight 2.913
}
hostgroup hg1-2 {
id -31 # do not change unnecessarily
id -29 class nvme # do not change unnecessarily
id -55 class hdd # do not change unnecessarily
# weight 0.000
alg straw2
hash 0 # rjenkins1
}
hostgroup hg1-3 {
id -32 # do not change unnecessarily
id -43 class nvme # do not change unnecessarily
id -56 class hdd # do not change unnecessarily
# weight 65.496
alg straw2
hash 0 # rjenkins1
item storage23 weight 65.496
}
hostgroup hg2-1 {
id -33 # do not change unnecessarily
id -45 class nvme # do not change unnecessarily
id -58 class hdd # do not change unnecessarily
# weight 2.912
alg straw2
hash 0 # rjenkins1
item storage12 weight 2.912
}
hostgroup hg2-2 {
id -34 # do not change unnecessarily
id -46 class nvme # do not change unnecessarily
id -59 class hdd # do not change unnecessarily
# weight 65.496
alg straw2
hash 0 # rjenkins1
item storage21 weight 65.496
}
hostgroup hg2-3 {
id -35 # do not change unnecessarily
id -47 class nvme # do not change unnecessarily
id -60 class hdd # do not change unnecessarily
# weight 65.496
alg straw2
hash 0 # rjenkins1
item storage23 weight 65.496
}
hostgroup hg3-1 {
id -36 # do not change unnecessarily
id -49 class nvme # do not change unnecessarily
id -62 class hdd # do not change unnecessarily
# weight 2.912
alg straw2
hash 0 # rjenkins1
item storage13 weight 2.912
}
hostgroup hg3-2 {
id -37 # do not change unnecessarily
id -50 class nvme # do not change unnecessarily
id -63 class hdd # do not change unnecessarily
# weight 65.496
alg straw2
hash 0 # rjenkins1
item storage21 weight 65.496
}
hostgroup hg3-3 {
id -38 # do not change unnecessarily
id -51 class nvme # do not change unnecessarily
id -64 class hdd # do not change unnecessarily
# weight 0.000
alg straw2
hash 0 # rjenkins1
}
datacenter ldc1 {
id -39 # do not change unnecessarily
id -44 class nvme # do not change unnecessarily
id -57 class hdd # do not change unnecessarily
# weight 68.409
alg straw2
hash 0 # rjenkins1
item hg1-1 weight 2.913
item hg1-2 weight 0.000
item hg1-3 weight 65.496
}
datacenter ldc2 {
id -40 # do not change unnecessarily
id -48 class nvme # do not change unnecessarily
id -61 class hdd # do not change unnecessarily
# weight 133.904
alg straw2
hash 0 # rjenkins1
item hg2-1 weight 2.912
item hg2-2 weight 65.496
item hg2-3 weight 65.496
}
datacenter ldc3 {
id -41 # do not change unnecessarily
id -52 class nvme # do not change unnecessarily
id -65 class hdd # do not change unnecessarily
# weight 68.408
alg straw2
hash 0 # rjenkins1
item hg3-1 weight 2.912
item hg3-2 weight 65.496
item hg3-3 weight 0.000
}
root ldc {
id -42 # do not change unnecessarily
id -53 class nvme # do not change unnecessarily
id -66 class hdd # do not change unnecessarily
# weight 270.721
alg straw2
hash 0 # rjenkins1
item ldc1 weight 68.409
item ldc2 weight 133.904
item ldc3 weight 68.408
}
# rules
rule hybrid {
id 1
type replicated
min_size 1
max_size 10
step take ldc
step choose firstn 1 type datacenter
step chooseleaf firstn 0 type hostgroup
step emit
}
rule hdd {
id 2
type replicated
min_size 1
max_size 3
step take default class hdd
step chooseleaf firstn 0 type datacenter
step emit
}
rule nvme {
id 3
type replicated
min_size 1
max_size 3
step take default class nvme
step chooseleaf firstn 0 type datacenter
step emit
}
# end crush map
On 10/8/2017 3:22 PM, David Turner wrote:
>
> That's correct. It doesn't matter how many copies of the data you have
> in each datacenter. The mons control the maps and you should be good
> as long as you have 1 mon per DC. You should test this to see how the
> recovery goes, but there shouldn't be a problem.
>
>
> On Sat, Oct 7, 2017, 6:10 PM Дробышевский, Владимир <[email protected]
> <mailto:[email protected]>> wrote:
>
> 2017-10-08 2:02 GMT+05:00 Peter Linder
> <[email protected] <mailto:[email protected]>>:
>
>>
>> Then, I believe, the next best configuration would be to set
>> size for this pool to 4. It would choose an NVMe as the
>> primary OSD, and then choose an HDD from each DC for the
>> secondary copies. This will guarantee that a copy of the
>> data goes into each DC and you will have 2 copies in other
>> DCs away from the primary NVMe copy. It wastes a copy of all
>> of the data in the pool, but that's on the much cheaper HDD
>> storage and can probably be considered acceptable losses for
>> the sake of having the primary OSD on NVMe drives.
> I have considered this, and it should of course work when it
> works so to say, but what if 1 datacenter is isolated while
> running? We would be left with 2 running copies on each side
> for all PGs, with no way of knowing what gets written where.
> In the end, data would be destoyed due to the split brain.
> Even being able to enforce quorum where the SSD is would mean
> a single point of failure.
>
> In case you have one mon per DC all operations in the isolated DC
> will be frozen, so I believe you would not lose data.
>
>
>
>
>>
>> On Sat, Oct 7, 2017 at 3:36 PM Peter Linder
>> <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>> On 10/7/2017 8:08 PM, David Turner wrote:
>>>
>>> Just to make sure you understand that the reads will
>>> happen on the primary osd for the PG and not the nearest
>>> osd, meaning that reads will go between the datacenters.
>>> Also that each write will not ack until all 3 writes
>>> happen adding the latency to the writes and reads both.
>>>
>>>
>>
>> Yes, I understand this. It is actually fine, the
>> datacenters have been selected so that they are about
>> 10-20km apart. This yields around a 0.1 - 0.2ms round
>> trip time due to speed of light being too low.
>> Nevertheless, latency due to network shouldn't be a
>> problem and it's all 40G (dedicated) TRILL network for
>> the moment.
>>
>> I just want to be able to select 1 SSD and 2 HDDs, all
>> spread out. I can do that, but one of the HDDs end up in
>> the same datacenter, probably because I'm using the
>> "take" command 2 times (resets selecting buckets?).
>>
>>
>>
>>> On Sat, Oct 7, 2017, 1:48 PM Peter Linder
>>> <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>> On 10/7/2017 7:36 PM, Дробышевский, Владимир wrote:
>>>> Hello!
>>>>
>>>> 2017-10-07 19:12 GMT+05:00 Peter Linder
>>>> <[email protected]
>>>> <mailto:[email protected]>>:
>>>>
>>>> The idea is to select an nvme osd, and
>>>> then select the rest from hdd osds in different
>>>> datacenters (see crush
>>>> map below for hierarchy).
>>>>
>>>> It's a little bit aside of the question, but why do
>>>> you want to mix SSDs and HDDs in the same pool? Do
>>>> you have read-intensive workload and going to use
>>>> primary-affinity to get all reads from nvme?
>>>>
>>>>
>>> Yes, this is pretty much the idea, getting the
>>> performance from NVMe reads, while still maintaining
>>> triple redundancy and a reasonable cost.
>>>
>>>
>>>> --
>>>> Regards,
>>>> Vladimir
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> [email protected]
>>> <mailto:[email protected]>
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>
>
> _______________________________________________
> ceph-users mailing list
> [email protected] <mailto:[email protected]>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> --
> Regards,
> Vladimir
> _______________________________________________
> ceph-users mailing list
> [email protected] <mailto:[email protected]>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com