Re: [ceph-users] CRUSH Rule Review - Not replicating correctly

deeepdish Wed, 20 Jan 2016 06:55:41 -0800

Hi Robert,

Just wanted to let you know that after applying your crush suggestion and 
allowing cluster to rebalance itself, I now have symmetrical data distribution. 
  In keeping 5 monitors my rationale is availability.   I have 3 compute nodes 
+ 2 storage nodes.   I was thinking that making all of them a monitor would 
provide an additional backups.  Based on your earlier comments, can you provide 
guidance on how much latency is induced by having excess monitors deployed?


Thanks.


> On Jan 18, 2016, at 12:36 , Robert LeBlanc <[email protected]> wrote:
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> Not that I know of.
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Mon, Jan 18, 2016 at 10:33 AM, deeepdish  wrote:
>> Thanks Robert.   Will definitely try this.   Is there a way to implement 
>> “gradual CRUSH” changes?   I noticed whenever cluster wide changes are 
>> pushed (crush map, for instance) the cluster immediately attempts to align 
>> itself disrupting client access / performance…
>> 
>> 
>>> On Jan 18, 2016, at 12:22 , Robert LeBlanc  wrote:
>>> 
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA256
>>> 
>>> I'm not sure why you have six monitors. Six monitors buys you nothing
>>> over five monitors other than more power being used, and more latency
>>> and more headache. See
>>> http://docs.ceph.com/docs/hammer/rados/configuration/mon-config-ref/#monitor-quorum
>>>  
>>> <http://docs.ceph.com/docs/hammer/rados/configuration/mon-config-ref/#monitor-quorum>
>>> for some more info. Also, I'd consider 5 monitors overkill for this
>>> size cluster, I'd recommend three.
>>> 
>>> Although this is most likely not the root cause of your problem, you
>>> probably have an error here: "root replicated-T1" is pointing to
>>> b02s08 and b02s12 and "site erbus" is also pointing to b02s08 and
>>> b02s12. You probably meant to have "root replicated-T1" pointing to
>>> erbus instead.
>>> 
>>> Where I think your problem is, is in your "rule replicated" section.
>>> You can try:
>>> step take replicated-T1
>>> step choose firstn 2 type host
>>> step chooseleaf firstn 2 type osdgroup
>>> step emit
>>> 
>>> What this does is choose two hosts from the root replicated-T1 (which
>>> happens to be both hosts you have), then chooses an OSD from two
>>> osdgroups on each host.
>>> 
>>> I believe the problem with your current rule set is that firstn 0 type
>>> host tries to select four hosts, but only two are available. You
>>> should be able to see that with 'ceph pg dump', where only two osds
>>> will be listed in the up set.
>>> 
>>> I hope that helps.
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: Mailvelope v1.3.3
>>> Comment: https://www.mailvelope.com <https://www.mailvelope.com/>
>>> 
>>> wsFcBAEBCAAQBQJWnR9kCRDmVDuy+mK58QAA5hUP/iJprG4nGR2sJvL//8l+
>>> V6oLYXTCs8lHeKL3ZPagThE9oh2xDMV37WR3I/xMNTA8735grl8/AAhy8ypW
>>> MDOikbpzfWnlaL0SWs5rIQ5umATwv73Fg/Mf+K2Olt8IGP6D0NMIxfeOjU6E
>>> 0Sc3F37nDQFuDEkBYjcVcqZC89PByh7yaId+eOgr7Ot+BZL/3fbpWIZ9kyD5
>>> KoPYdPjtFruoIpc8DJydzbWdmha65DkB65QOZlI3F3lMc6LGXUopm4OP4sQd
>>> txVKFtTcLh97WgUshQMSWIiJiQT7+3D6EqQyPzlnei3O3gACpkpsmUteDPpn
>>> p8CDeJtIpgKnQZjBwfK/bUQXdIGem8Y0x/PC+1ekIhkHCIJeW2sD3mFJduDQ
>>> 9loQ9+IsWHfQmEHLMLdeNzRXbgBY2djxP2X70fXTg31fx+dYvbWeulYJHiKi
>>> 1fJS4GdbPjoRUp5k4lthk3hDTFD/f5ZuowLDIaexgISb0bIJcObEn9RWlHut
>>> IRVi0fUuRVIX3snGMOKjLmSUe87Od2KSEbULYPTLYDMo/FsWXWHNlP3gVKKd
>>> lQJdxcwXOW7/v5oayY4wiEE6NF4rCupcqt0nPxxmbehmeRPxgkWCKJJs3FNr
>>> VmUdnrdpfxzR5c8dmOELJnpNS6MTT56B8A4kKmqbbHCEKpZ83piG7uwqc+6f
>>> RKkQ
>>> =gp/0
>>> -----END PGP SIGNATURE-----
>>> ----------------
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> 
>>> 
>>> On Sun, Jan 17, 2016 at 6:31 PM, deeepdish  wrote:
>>>> Hi Everyone,
>>>> 
>>>> Looking for a double check of my logic and crush map..
>>>> 
>>>> Overview:
>>>> 
>>>> - osdgroup bucket type defines failure domain within a host of 5 OSDs + 1
>>>> SSD.   Therefore 5 OSDs (all utilizing the same journal) constitute an
>>>> osdgroup bucket.   Each host has 4 osdgroups.
>>>> - 6 monitors
>>>> - Two node cluster
>>>> - Each node:
>>>> - 20 OSDs
>>>> -  4 SSDs
>>>> - 4 osdgroups
>>>> 
>>>> Desired Crush Rule outcome:
>>>> - Assuming a pool with min_size=2 and size=4, all each node would contain a
>>>> redundant copy of each object.   Should any of the hosts fail, access to
>>>> data would be uninterrupted.
>>>> 
>>>> Current Crush Rule outcome:
>>>> - There are 4 copies of each object, however I don’t believe each node has 
>>>> a
>>>> redundant copy of each object, when a node fails, data is NOT accessible
>>>> until ceph rebuilds itself / node becomes accessible again.
>>>> 
>>>> I susepct my crush is not right, and to remedy it may take some time and
>>>> cause cluster to be unresponsive / unavailable.    Is there a way / method
>>>> to apply substantial crush changes gradually to a cluster?
>>>> 
>>>> Thanks for your help.
>>>> 
>>>> 
>>>> Current crush map:
>>>> 
>>>> # begin crush map
>>>> tunable choose_local_tries 0
>>>> tunable choose_local_fallback_tries 0
>>>> tunable choose_total_tries 50
>>>> tunable chooseleaf_descend_once 1
>>>> tunable straw_calc_version 1
>>>> 
>>>> # devices
>>>> device 0 osd.0
>>>> device 1 osd.1
>>>> device 2 osd.2
>>>> device 3 osd.3
>>>> device 4 osd.4
>>>> device 5 osd.5
>>>> device 6 osd.6
>>>> device 7 osd.7
>>>> device 8 osd.8
>>>> device 9 osd.9
>>>> device 10 osd.10
>>>> device 11 osd.11
>>>> device 12 osd.12
>>>> device 13 osd.13
>>>> device 14 osd.14
>>>> device 15 osd.15
>>>> device 16 osd.16
>>>> device 17 osd.17
>>>> device 18 osd.18
>>>> device 19 osd.19
>>>> device 20 osd.20
>>>> device 21 osd.21
>>>> device 22 osd.22
>>>> device 23 osd.23
>>>> device 24 osd.24
>>>> device 25 osd.25
>>>> device 26 osd.26
>>>> device 27 osd.27
>>>> device 28 osd.28
>>>> device 29 osd.29
>>>> device 30 osd.30
>>>> device 31 osd.31
>>>> device 32 osd.32
>>>> device 33 osd.33
>>>> device 34 osd.34
>>>> device 35 osd.35
>>>> device 36 osd.36
>>>> device 37 osd.37
>>>> device 38 osd.38
>>>> device 39 osd.39
>>>> 
>>>> # types
>>>> type 0 osd
>>>> type 1 osdgroup
>>>> type 2 host
>>>> type 3 rack
>>>> type 4 site
>>>> type 5 root
>>>> 
>>>> # buckets
>>>> osdgroup b02s08-osdgroupA {
>>>> id -81 # do not change unnecessarily
>>>> # weight 18.100
>>>> alg straw
>>>> hash 0 # rjenkins1
>>>> item osd.0 weight 3.620
>>>> item osd.1 weight 3.620
>>>> item osd.2 weight 3.620
>>>> item osd.3 weight 3.620
>>>> item osd.4 weight 3.620
>>>> }
>>>> osdgroup b02s08-osdgroupB {
>>>> id -82 # do not change unnecessarily
>>>> # weight 18.100
>>>> alg straw
>>>> hash 0 # rjenkins1
>>>> item osd.5 weight 3.620
>>>> item osd.6 weight 3.620
>>>> item osd.7 weight 3.620
>>>> item osd.8 weight 3.620
>>>> item osd.9 weight 3.620
>>>> }
>>>> osdgroup b02s08-osdgroupC {
>>>> id -83 # do not change unnecessarily
>>>> # weight 19.920
>>>> alg straw
>>>> hash 0 # rjenkins1
>>>> item osd.10 weight 3.620
>>>> item osd.11 weight 3.620
>>>> item osd.12 weight 3.620
>>>> item osd.13 weight 3.620
>>>> item osd.14 weight 5.440
>>>> }
>>>> osdgroup b02s08-osdgroupD {
>>>> id -84 # do not change unnecessarily
>>>> # weight 19.920
>>>> alg straw
>>>> hash 0 # rjenkins1
>>>> item osd.15 weight 3.620
>>>> item osd.16 weight 3.620
>>>> item osd.17 weight 3.620
>>>> item osd.18 weight 3.620
>>>> item osd.19 weight 5.440
>>>> }
>>>> host b02s08 {
>>>> id -80 # do not change unnecessarily
>>>> # weight 76.040
>>>> alg straw
>>>> hash 0 # rjenkins1
>>>> item b02s08-osdgroupA weight 18.100
>>>> item b02s08-osdgroupB weight 18.100
>>>> item b02s08-osdgroupC weight 19.920
>>>> item b02s08-osdgroupD weight 19.920
>>>> }
>>>> osdgroup b02s12-osdgroupA {
>>>> id -121 # do not change unnecessarily
>>>> # weight 18.100
>>>> alg straw
>>>> hash 0 # rjenkins1
>>>> item osd.20 weight 3.620
>>>> item osd.21 weight 3.620
>>>> item osd.22 weight 3.620
>>>> item osd.23 weight 3.620
>>>> item osd.24 weight 3.620
>>>> }
>>>> osdgroup b02s12-osdgroupB {
>>>> id -122 # do not change unnecessarily
>>>> # weight 18.100
>>>> alg straw
>>>> hash 0 # rjenkins1
>>>> item osd.25 weight 3.620
>>>> item osd.26 weight 3.620
>>>> item osd.27 weight 3.620
>>>> item osd.28 weight 3.620
>>>> item osd.29 weight 3.620
>>>> }
>>>> osdgroup b02s12-osdgroupC {
>>>> id -123 # do not change unnecessarily
>>>> # weight 19.920
>>>> alg straw
>>>> hash 0 # rjenkins1
>>>> item osd.30 weight 3.620
>>>> item osd.31 weight 3.620
>>>> item osd.32 weight 3.620
>>>> item osd.33 weight 3.620
>>>> item osd.34 weight 5.440
>>>> }
>>>> osdgroup b02s12-osdgroupD {
>>>> id -124 # do not change unnecessarily
>>>> # weight 19.920
>>>> alg straw
>>>> hash 0 # rjenkins1
>>>> item osd.35 weight 3.620
>>>> item osd.36 weight 3.620
>>>> item osd.37 weight 3.620
>>>> item osd.38 weight 3.620
>>>> item osd.39 weight 5.440
>>>> }
>>>> host b02s12 {
>>>> id -120 # do not change unnecessarily
>>>> # weight 76.040
>>>> alg straw
>>>> hash 0 # rjenkins1
>>>> item b02s12-osdgroupA weight 18.100
>>>> item b02s12-osdgroupB weight 18.100
>>>> item b02s12-osdgroupC weight 19.920
>>>> item b02s12-osdgroupD weight 19.920
>>>> }
>>>> root replicated-T1 {
>>>> id -1 # do not change unnecessarily
>>>> # weight 152.080
>>>> alg straw
>>>> hash 0 # rjenkins1
>>>> item b02s08 weight 76.040
>>>> item b02s12 weight 76.040
>>>> }
>>>> rack b02 {
>>>> id -20 # do not change unnecessarily
>>>> # weight 152.080
>>>> alg straw
>>>> hash 0 # rjenkins1
>>>> item b02s08 weight 76.040
>>>> item b02s12 weight 76.040
>>>> }
>>>> site erbus {
>>>> id -10 # do not change unnecessarily
>>>> # weight 152.080
>>>> alg straw
>>>> hash 0 # rjenkins1
>>>> item b02 weight 152.080
>>>> }
>>>> 
>>>> # rules
>>>> rule replicated {
>>>> ruleset 0
>>>> type replicated
>>>> min_size 1
>>>> max_size 10
>>>> step take replicated-T1
>>>> step choose firstn 0 type host
>>>> step chooseleaf firstn 0 type osdgroup
>>>> step emit
>>>> }
>>>> 
>>>> # end crush map
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> [email protected]
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>> 
> 
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.3.3
> Comment: https://www.mailvelope.com <https://www.mailvelope.com/>
> 
> wsFcBAEBCAAQBQJWnSKNCRDmVDuy+mK58QAATgIQAIHVBvSoQ2pQ6/J/+KI6
> 5TfjqAhJ3Q7E9JwC0suZ9JRORhBcrbab5wnY4oMabLeAEazTST5gAedeMBV4
> vFA1aG5RBpUcir1+49BZYpHuUuJuvviTSrVjojbr6eISvsJfFwq7BosQZw7h
> DxExk8Pm5l8cDXd2z03f34F7xDfX3u0UsLm/TCTfzxFAmwngkC6rsElJSXp+
> MQ9mjBIncx+HkDWyshJjKBqhhXVfOa+euUuCcmTlIiGgIaA5PXNG+q+OJMnq
> 0ONb9TF51ApW3NvIgRMKo94g+rw7wSEJbe7LkzJOgkJ19rrLrit3uhCOLDha
> iF2ELgd9jNRbsODUd0iTU9DgecoWuCZMsCdpeYyoN+BO1OLAdNjgQH9JaHnx
> JeIT538/x8gSi8S7We0FjdPvY0dbIMneROocK8/e+byboindodfV0z2YJ4C3
> kEkweIW+45PHjQWLPU6SVYtdZyHoUPOrCpEOTo/9uOILx8nmvcY2SEhyuiFd
> 6QfZKKCwPmhNDNB+UUPzrd8cp794RDp9bYue3Ql5L1K4Yln0nXEvzVoNp8eB
> JXjsFXkPMBB8njiS7E4e7CGc64azVlagGZ+H99jbCLVdyaTrT+9+/WGwl1Ut
> OzDhwuU/dPYodT2ULPYtrMN03LoKozd2MKh0wjebwONJOUgMxUvGLcffSmMb
> /TJ6
> =tLy3
> -----END PGP SIGNATURE-----

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CRUSH Rule Review - Not replicating correctly

Reply via email to