Hi,
Thanks to your answers now I understand better this part of ceph. I did the
change on the crushmap that Maxime suggested, after that the results are
what I expect from the beginning:
# ceph osd df
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
0 7.27100 1.00000 7445G 1830G 5614G 24.59 0.98 238
3 7.27100 1.00000 7445G 1700G 5744G 22.84 0.91 229
4 7.27100 1.00000 7445G 1731G 5713G 23.26 0.93 233
1 1.81299 1.00000 1856G 661G 1195G 35.63 1.43 87
5 1.81299 1.00000 1856G 544G 1311G 29.34 1.17 73
6 1.81299 1.00000 1856G 519G 1337G 27.98 1.12 71
2 2.72198 1.00000 2787G 766G 2021G 27.50 1.10 116
7 2.72198 1.00000 2787G 651G 2136G 23.36 0.93 103
8 2.72198 1.00000 2787G 661G 2126G 23.72 0.95 98
TOTAL 36267G 9067G 27200G 25.00
MIN/MAX VAR: 0.91/1.43 STDDEV: 4.20
#
I understand that the ceph defaults "type host" are safer than "type osd",
but like I said before this cluster is only for testing purposes only.
Thanks for all your answers :)
2017-06-06 9:20 GMT+02:00 Maxime Guyot <[email protected]>:
> Hi Félix,
>
> Changing the failure domain to OSD is probably the easiest option if this
> is a test cluster. I think the commands would go like:
> - ceph osd getcrushmap -o map.bin
> - crushtool -d map.bin -o map.txt
> - sed -i 's/step chooseleaf firstn 0 type host/step chooseleaf firstn 0
> type osd/' map.txt
> - crushtool -c map.txt -o map.bin
> - ceph osd setcrushmap -i map.bin
>
> Moving HDDs into ~8TB/server would be a good option if this is a capacity
> focused use case. It will allow you to reboot 1 server at a time without
> radosgw down time. You would target for 26/3 = 8.66TB/ node so:
> - node1: 1x8TB
> - node2: 1x8TB +1x2TB
> - node3: 2x6 TB + 1x2TB
>
> If you are more concerned about performance then set the weights to 1 on
> all HDDs and forget about the wasted capacity.
>
> Cheers,
> Maxime
>
>
> On Tue, 6 Jun 2017 at 00:44 Christian Wuerdig <[email protected]>
> wrote:
>
>> Yet another option is to change the failure domain to OSD instead host
>> (this avoids having to move disks around and will probably meet you initial
>> expectations).
>> Means your cluster will become unavailable when you loose a host until
>> you fix it though. OTOH you probably don't have too much leeway anyway with
>> just 3 hosts so it might be an acceptable trade-off. It also means you can
>> just add new OSDs to the servers wherever they fit.
>>
>> On Tue, Jun 6, 2017 at 1:51 AM, David Turner <[email protected]>
>> wrote:
>>
>>> If you want to resolve your issue without purchasing another node, you
>>> should move one disk of each size into each server. This process will be
>>> quite painful as you'll need to actually move the disks in the crush map to
>>> be under a different host and then all of your data will move around, but
>>> then your weights will be able to utilize the weights and distribute the
>>> data between the 2TB, 3TB, and 8TB drives much more evenly.
>>>
>>> On Mon, Jun 5, 2017 at 9:21 AM Loic Dachary <[email protected]> wrote:
>>>
>>>>
>>>>
>>>> On 06/05/2017 02:48 PM, Christian Balzer wrote:
>>>> >
>>>> > Hello,
>>>> >
>>>> > On Mon, 5 Jun 2017 13:54:02 +0200 Félix Barbeira wrote:
>>>> >
>>>> >> Hi,
>>>> >>
>>>> >> We have a small cluster for radosgw use only. It has three nodes,
>>>> witch 3
>>>> > ^^^^^ ^^^^^
>>>> >> osds each. Each node has different disk sizes:
>>>> >>
>>>> >
>>>> > There's your answer, staring you right in the face.
>>>> >
>>>> > Your default replication size is 3, your default failure domain is
>>>> host.
>>>> >
>>>> > Ceph can not distribute data according to the weight, since it needs
>>>> to be
>>>> > on a different node (one replica per node) to comply with the replica
>>>> size.
>>>>
>>>> Another way to look at it is to imagine a situation where 10TB worth of
>>>> data
>>>> is stored on node01 which has 8x3 24TB. Since you asked for 3 replicas,
>>>> this
>>>> data must be replicated to node02 but ... there only is 2x3 6TB
>>>> available.
>>>> So the maximum you can store is 6TB and remaining disk space on node01
>>>> and node03
>>>> will never be used.
>>>>
>>>> python-crush analyze will display a message about that situation and
>>>> show which buckets
>>>> are overweighted.
>>>>
>>>> Cheers
>>>>
>>>> >
>>>> > If your cluster had 4 or more nodes, you'd see what you expected.
>>>> > And most likely wouldn't be happy about the performance with your 8TB
>>>> HDDs
>>>> > seeing 4 times more I/Os than then 2TB ones and thus becoming the
>>>> > bottleneck of your cluster.
>>>> >
>>>> > Christian
>>>> >
>>>> >> node01 : 3x8TB
>>>> >> node02 : 3x2TB
>>>> >> node03 : 3x3TB
>>>> >>
>>>> >> I thought that the weight handle the amount of data that every osd
>>>> receive.
>>>> >> In this case for example the node with the 8TB disks should receive
>>>> more
>>>> >> than the rest, right? All of them receive the same amount of data
>>>> and the
>>>> >> smaller disk (2TB) reaches 100% before the bigger ones. Am I doing
>>>> >> something wrong?
>>>> >>
>>>> >> The cluster is jewel LTS 10.2.7.
>>>> >>
>>>> >> # ceph osd df
>>>> >> ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
>>>> >> 0 7.27060 1.00000 7445G 1012G 6432G 13.60 0.57 133
>>>> >> 3 7.27060 1.00000 7445G 1081G 6363G 14.52 0.61 163
>>>> >> 4 7.27060 1.00000 7445G 787G 6657G 10.58 0.44 120
>>>> >> 1 1.81310 1.00000 1856G 1047G 809G 56.41 2.37 143
>>>> >> 5 1.81310 1.00000 1856G 956G 899G 51.53 2.16 143
>>>> >> 6 1.81310 1.00000 1856G 877G 979G 47.24 1.98 130
>>>> >> 2 2.72229 1.00000 2787G 1010G 1776G 36.25 1.52 140
>>>> >> 7 2.72229 1.00000 2787G 831G 1955G 29.83 1.25 130
>>>> >> 8 2.72229 1.00000 2787G 1038G 1748G 37.27 1.56 146
>>>> >> TOTAL 36267G 8643G 27624G 23.83
>>>> >> MIN/MAX VAR: 0.44/2.37 STDDEV: 18.60
>>>> >> #
>>>> >>
>>>> >> # ceph osd tree
>>>> >> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>>> >> -1 35.41795 root default
>>>> >> -2 21.81180 host node01
>>>> >> 0 7.27060 osd.0 up 1.00000 1.00000
>>>> >> 3 7.27060 osd.3 up 1.00000 1.00000
>>>> >> 4 7.27060 osd.4 up 1.00000 1.00000
>>>> >> -3 5.43929 host node02
>>>> >> 1 1.81310 osd.1 up 1.00000 1.00000
>>>> >> 5 1.81310 osd.5 up 1.00000 1.00000
>>>> >> 6 1.81310 osd.6 up 1.00000 1.00000
>>>> >> -4 8.16687 host node03
>>>> >> 2 2.72229 osd.2 up 1.00000 1.00000
>>>> >> 7 2.72229 osd.7 up 1.00000 1.00000
>>>> >> 8 2.72229 osd.8 up 1.00000 1.00000
>>>> >> #
>>>> >>
>>>> >> # ceph -s
>>>> >> cluster 49ba9695-7199-4c21-9199-ac321e60065e
>>>> >> health HEALTH_OK
>>>> >> monmap e1: 3 mons at
>>>> >> {ceph-mon01=[x:x:x:x:x:x:x:x]:6789/0,ceph-mon02=[x:x:x:x:x:
>>>> x:x:x]:6789/0,ceph-mon03=[x:x:x:x:x:x:x:x]:6789/0}
>>>> >> election epoch 48, quorum 0,1,2
>>>> ceph-mon01,ceph-mon03,ceph-mon02
>>>> >> osdmap e265: 9 osds: 9 up, 9 in
>>>> >> flags sortbitwise,require_jewel_osds
>>>> >> pgmap v95701: 416 pgs, 11 pools, 2879 GB data, 729 kobjects
>>>> >> 8643 GB used, 27624 GB / 36267 GB avail
>>>> >> 416 active+clean
>>>> >> #
>>>> >>
>>>> >> # ceph osd pool ls
>>>> >> .rgw.root
>>>> >> default.rgw.control
>>>> >> default.rgw.data.root
>>>> >> default.rgw.gc
>>>> >> default.rgw.log
>>>> >> default.rgw.users.uid
>>>> >> default.rgw.users.keys
>>>> >> default.rgw.buckets.index
>>>> >> default.rgw.buckets.non-ec
>>>> >> default.rgw.buckets.data
>>>> >> default.rgw.users.email
>>>> >> #
>>>> >>
>>>> >> # ceph df
>>>> >> GLOBAL:
>>>> >> SIZE AVAIL RAW USED %RAW USED
>>>> >> 36267G 27624G 8643G 23.83
>>>> >> POOLS:
>>>> >> NAME ID USED %USED MAX
>>>> AVAIL
>>>> >> OBJECTS
>>>> >> .rgw.root 1 1588 0
>>>> 5269G
>>>> >> 4
>>>> >> default.rgw.control 2 0 0
>>>> 5269G
>>>> >> 8
>>>> >> default.rgw.data.root 3 8761 0
>>>> 5269G
>>>> >> 28
>>>> >> default.rgw.gc 4 0 0
>>>> 5269G
>>>> >> 32
>>>> >> default.rgw.log 5 0 0
>>>> 5269G
>>>> >> 127
>>>> >> default.rgw.users.uid 6 4887 0
>>>> 5269G
>>>> >> 28
>>>> >> default.rgw.users.keys 7 144 0
>>>> 5269G
>>>> >> 16
>>>> >> default.rgw.buckets.index 9 0 0
>>>> 5269G
>>>> >> 14
>>>> >> default.rgw.buckets.non-ec 10 0 0
>>>> 5269G
>>>> >> 3
>>>> >> default.rgw.buckets.data 11 2879G 35.34
>>>> 5269G
>>>> >> 746848
>>>> >> default.rgw.users.email 12 13 0
>>>> 5269G
>>>> >> 1
>>>> >> #
>>>> >>
>>>> >
>>>> >
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> [email protected]
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> [email protected]
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
--
Félix Barbeira.
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com