Re: [ceph-users] Fixed all active+remapped PGs stuck forever (but I have no clue why)

John Morris Mon, 18 Aug 2014 10:15:03 -0700


On 08/14/2014 02:35 AM, Christian Balzer wrote:


The default (firefly, but previous ones are functionally identical) crush
map has:
---
# rules
rule replicated_ruleset {
         ruleset 0
         type replicated
         min_size 1
         max_size 10
         step take default
         step chooseleaf firstn 0 type host
         step emit
}
---

The type host states that there will be not more that one replica per host
(node), so with size=3 you will need at least 3 hosts to choose from.
If you were to change this to to type OSD, all 3 replicas could wind up on
the same host, not really a good idea.

Ah, this is a great clue. (On my cluster, the default rule contains'step choose firstn 0 type osd', and thus has the problem you hint at here.)

So I played with a new rule set with the buckets 'root', 'rack', 'host','bank' and 'osd', of which 'rack' and 'host' are unused. The 'bank'bucket: the OSD nodes each contain two 'banks' of disks with a separatedisk controller channel, a separate power supply cable, and a separateSSD. Thus, 'bank' actually does represent a real failure domain. Moreimportantly, this provides a bucket level below 'osd' that is big enoughfor 3-4 replicas. Here's the rule:


rule by_bank {
        ruleset 3
        type replicated
        min_size 3
        max_size 4
        step take default
        step choose firstn 0 type bank
        step choose firstn 0 type osd
        step emit
}

If the OP (sorry, Craig, you do have a name ;) wants to play with CRUSHmap rules, here's the quick and dirty of what I did:


# get the current 'orig' CRUSH map, decompile and edit; see:

#http://ceph.com/docs/master/rados/operations/crush-map/#editing-a-crush-map


ceph osd getcrushmap -o /tmp/crush-orig.bin
crushtool -d /tmp/crush-orig.bin -o /tmp/crush.txt
$EDITOR /tmp/crush.txt

# edit the crush map with your fave editor; see:
# http://ceph.com/docs/master/rados/operations/crush-map
#
# in my case, I added the bank type:

type 0 osd
type 1 bank
type 2 host
type 3 rack
type 4 root

# the banks (repeat as applicable):

bank bank0 {
        id -6
        alg straw
        hash 0
        item osd.0 weight 1.000
        item osd.1 weight 1.000
}

bank bank1 {
        id -7
        alg straw
        hash 0
        item osd.2 weight 1.000
        item osd.3 weight 1.000
}

# updated the hosts (repeat as applicable):

host host0 {
        id -4           # do not change unnecessarily
        # weight 3.000
        alg straw
        hash 0  # rjenkins1
        item bank0 weight 2.000
        item bank1 weight 2.000
}

# and added the rule:

rule by_bank {
        ruleset 3
        type replicated
        min_size 3
        max_size 4
        step take default
        step choose firstn 0 type bank
        step choose firstn 0 type osd
        step emit
}

# compile the crush map:

crushtool -c /tmp/crush.txt -o /tmp/crush-new.bin

# and run some tests; the replica sizes tested come from
# 'min_size' and 'max_size' in the above rule; see:
# http://ceph.com/docs/master/man/8/crushtool/#running-tests-with-test
#
# show sample PG->OSD maps:

crushtool -i /tmp/crush-new.bin --test --show-statistics

# show bad mappings; if the CRUSH map is correct,
# this should be empty:

crushtool -i /tmp/crush-new.bin --test --show-bad-mappings

# show per-OSD pg utilization:

crushtool -i /tmp/crush-new.bin --test --show-utilization

You might finackle something like that (again the rule splits on hosts) by
having multiple "hosts" on one physical machine, but therein lies madness.

Well, the bucket names can be changed, as above, and Sage hints at doingsomething like this here:


http://wiki.ceph.com/Planning/Blueprints/Dumpling/extend_crush_rule_language

(And IIUC he also proposes something to implement my originalintentions: distribute four replicas, two on each of two racks, anddon't put two replicas on the same host within a rack; this is moreeasily generalized than the above funky configuration.)


        John
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fixed all active+remapped PGs stuck forever (but I have no clue why)

Reply via email to