On 08/18/2014 02:20 PM, John Morris wrote:


On 08/18/2014 01:49 PM, Sage Weil wrote:
On Mon, 18 Aug 2014, John Morris wrote:
rule by_bank {
         ruleset 3
         type replicated
         min_size 3
         max_size 4
         step take default
         step choose firstn 0 type bank
         step choose firstn 0 type osd
         step emit
}

You probably want:

          step choose firstn 0 type bank
          step choose firstn 1 type osd

I.e., 3 (or 4) banks, and 1 osd in each.. not 3 banks with 3 osds in each
or 4 banks with 4 osds in each (for a total of 9 or 16 OSDs).

Yes, thanks.  Funny, testing still works with the incorrect version, and
the --show-utilization test results look similar.

In re. to my last email about tunables, those can also be expressed in
the human-readable map as such:

tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50

Wrapping up this exercise:

This little script helps to see exactly where things go, and show what goes wrong with my original, incorrect map.

#!/bin/bash
echo "compiling crush map"
crushtool -c /tmp/crush.txt -o /tmp/crush-new.bin \
    --enable-unsafe-tunables
bad="$(crushtool -i /tmp/crush2-new.bin --test \
        --show-bad-mappings 2>&1 | \
    wc -l)"
echo "number of bad mappings:  $bad"

distribution() {
    crushtool -i /tmp/crush2-new.bin --test --show-statistics \
        --num-rep $1 2>&1 | \
        awk '/\[.*\]/ {
            gsub("[][]","",$6);
            split($6,a,",");
            asort(a,d);
            print d[1], d[2], d[3], d[4]; }' | \
        sort | uniq -c
}
for i in 3 4; do
    echo "distribution of size=${i} replicas:"
    distribution $i
done


For --num-rep=4, the result looks like the following; it's easily seen that two sets of OSDs in the same bank are always picked, exactly what we do NOT want (note OSDs 0+1 in bank0, 1+2 in bank1, etc.):

    173 0 1 2 3
    176 0 1 4 5
    184 0 1 6 7
    171 2 3 4 5
    156 2 3 6 7
    164 4 5 6 7

After Sage's correction, the result looks like the following, with one OSD from each bank:

     70 0 2 4 6
     74 0 2 4 7
     65 0 2 5 6
     58 0 2 5 7
     60 0 3 4 6
     72 0 3 4 7
     80 0 3 5 6
     64 0 3 5 7
     48 1 2 4 6
     66 1 2 4 7
     72 1 2 5 6
     46 1 2 5 7
     73 1 3 4 6
     70 1 3 4 7
     51 1 3 5 6
     55 1 3 5 7

When replicas=3, the result is also correct.

So this is a bit of a hack, but it does seem to work to evenly distribute 3-4 replicas across a bucket level with only two nodes. Late into this exploration, it appears that if the 'bank' layer is undesirable, this also works to distribute evenly across hosts:

        step choose firstn 0 type host
        step choose firstn 2 type osd

In conclusion, this example doesn't seem so far-fetched, since it's easy to imagine wanting to distribute OSDs across two racks, or PDUs, or data centers, where it's not so unreasonable to say a third is out of the budget.

        John
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to