On 08/14/2014 02:35 AM, Christian Balzer wrote:
The default (firefly, but previous ones are functionally identical) crush
map has:
---
# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
---
The type host states that there will be not more that one replica per host
(node), so with size=3 you will need at least 3 hosts to choose from.
If you were to change this to to type OSD, all 3 replicas could wind up on
the same host, not really a good idea.
Ah, this is a great clue. (On my cluster, the default rule contains
'step choose firstn 0 type osd', and thus has the problem you hint at here.)
So I played with a new rule set with the buckets 'root', 'rack', 'host',
'bank' and 'osd', of which 'rack' and 'host' are unused. The 'bank'
bucket: the OSD nodes each contain two 'banks' of disks with a separate
disk controller channel, a separate power supply cable, and a separate
SSD. Thus, 'bank' actually does represent a real failure domain. More
importantly, this provides a bucket level below 'osd' that is big enough
for 3-4 replicas. Here's the rule:
rule by_bank {
ruleset 3
type replicated
min_size 3
max_size 4
step take default
step choose firstn 0 type bank
step choose firstn 0 type osd
step emit
}
If the OP (sorry, Craig, you do have a name ;) wants to play with CRUSH
map rules, here's the quick and dirty of what I did:
# get the current 'orig' CRUSH map, decompile and edit; see:
#
http://ceph.com/docs/master/rados/operations/crush-map/#editing-a-crush-map
ceph osd getcrushmap -o /tmp/crush-orig.bin
crushtool -d /tmp/crush-orig.bin -o /tmp/crush.txt
$EDITOR /tmp/crush.txt
# edit the crush map with your fave editor; see:
# http://ceph.com/docs/master/rados/operations/crush-map
#
# in my case, I added the bank type:
type 0 osd
type 1 bank
type 2 host
type 3 rack
type 4 root
# the banks (repeat as applicable):
bank bank0 {
id -6
alg straw
hash 0
item osd.0 weight 1.000
item osd.1 weight 1.000
}
bank bank1 {
id -7
alg straw
hash 0
item osd.2 weight 1.000
item osd.3 weight 1.000
}
# updated the hosts (repeat as applicable):
host host0 {
id -4 # do not change unnecessarily
# weight 3.000
alg straw
hash 0 # rjenkins1
item bank0 weight 2.000
item bank1 weight 2.000
}
# and added the rule:
rule by_bank {
ruleset 3
type replicated
min_size 3
max_size 4
step take default
step choose firstn 0 type bank
step choose firstn 0 type osd
step emit
}
# compile the crush map:
crushtool -c /tmp/crush.txt -o /tmp/crush-new.bin
# and run some tests; the replica sizes tested come from
# 'min_size' and 'max_size' in the above rule; see:
# http://ceph.com/docs/master/man/8/crushtool/#running-tests-with-test
#
# show sample PG->OSD maps:
crushtool -i /tmp/crush-new.bin --test --show-statistics
# show bad mappings; if the CRUSH map is correct,
# this should be empty:
crushtool -i /tmp/crush-new.bin --test --show-bad-mappings
# show per-OSD pg utilization:
crushtool -i /tmp/crush-new.bin --test --show-utilization
You might finackle something like that (again the rule splits on hosts) by
having multiple "hosts" on one physical machine, but therein lies madness.
Well, the bucket names can be changed, as above, and Sage hints at doing
something like this here:
http://wiki.ceph.com/Planning/Blueprints/Dumpling/extend_crush_rule_language
(And IIUC he also proposes something to implement my original
intentions: distribute four replicas, two on each of two racks, and
don't put two replicas on the same host within a rack; this is more
easily generalized than the above funky configuration.)
John
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com