On 08/18/2014 02:20 PM, John Morris wrote:
On 08/18/2014 01:49 PM, Sage Weil wrote:
On Mon, 18 Aug 2014, John Morris wrote:
rule by_bank {
ruleset 3
type replicated
min_size 3
max_size 4
step take default
step choose firstn 0 type bank
step choose firstn 0 type osd
step emit
}
You probably want:
step choose firstn 0 type bank
step choose firstn 1 type osd
I.e., 3 (or 4) banks, and 1 osd in each.. not 3 banks with 3 osds in each
or 4 banks with 4 osds in each (for a total of 9 or 16 OSDs).
Yes, thanks. Funny, testing still works with the incorrect version, and
the --show-utilization test results look similar.
In re. to my last email about tunables, those can also be expressed in
the human-readable map as such:
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
Wrapping up this exercise:
This little script helps to see exactly where things go, and show what
goes wrong with my original, incorrect map.
#!/bin/bash
echo "compiling crush map"
crushtool -c /tmp/crush.txt -o /tmp/crush-new.bin \
--enable-unsafe-tunables
bad="$(crushtool -i /tmp/crush2-new.bin --test \
--show-bad-mappings 2>&1 | \
wc -l)"
echo "number of bad mappings: $bad"
distribution() {
crushtool -i /tmp/crush2-new.bin --test --show-statistics \
--num-rep $1 2>&1 | \
awk '/\[.*\]/ {
gsub("[][]","",$6);
split($6,a,",");
asort(a,d);
print d[1], d[2], d[3], d[4]; }' | \
sort | uniq -c
}
for i in 3 4; do
echo "distribution of size=${i} replicas:"
distribution $i
done
For --num-rep=4, the result looks like the following; it's easily seen
that two sets of OSDs in the same bank are always picked, exactly what
we do NOT want (note OSDs 0+1 in bank0, 1+2 in bank1, etc.):
173 0 1 2 3
176 0 1 4 5
184 0 1 6 7
171 2 3 4 5
156 2 3 6 7
164 4 5 6 7
After Sage's correction, the result looks like the following, with one
OSD from each bank:
70 0 2 4 6
74 0 2 4 7
65 0 2 5 6
58 0 2 5 7
60 0 3 4 6
72 0 3 4 7
80 0 3 5 6
64 0 3 5 7
48 1 2 4 6
66 1 2 4 7
72 1 2 5 6
46 1 2 5 7
73 1 3 4 6
70 1 3 4 7
51 1 3 5 6
55 1 3 5 7
When replicas=3, the result is also correct.
So this is a bit of a hack, but it does seem to work to evenly
distribute 3-4 replicas across a bucket level with only two nodes. Late
into this exploration, it appears that if the 'bank' layer is
undesirable, this also works to distribute evenly across hosts:
step choose firstn 0 type host
step choose firstn 2 type osd
In conclusion, this example doesn't seem so far-fetched, since it's easy
to imagine wanting to distribute OSDs across two racks, or PDUs, or data
centers, where it's not so unreasonable to say a third is out of the budget.
John
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com