Thanks so much. my ceph osd df tree output is here:https://gist.github.com/hnuzhoulin/e83140168eb403f4712273e3bb925a1c
just like the output, based on the reply of David: when I out osd.132, its pg remap to just its host cld-osd12-56-sata. it seems the out do not change Host's weight. but if I out osd.10 which is in a replication pool, its pg remap to its media site1-rack1-ssd, not its host cld-osd1-56-ssd. This seems the out change Host's weight. so the out command does diff action among firstn and indep, am I right? if it is right, we need to reserve more available size for each disk in indep pool. when I try to execute ceph osd crush reweight osd.132 0.0, then its pg remap to its media site1-rack1-ssd just like only do out in firstn, so its Host's weight changed. pg diffs https://gist.github.com/hnuzhoulin/aab164975b4e3d31bbecbc5c8b2f1fef from this diff output, I got the difference of different strategies for selecting items OSDs in a CRUSH hierarchy. But I can not get why out command do a different action for firstn and indep. David Turner <[email protected]> 于2019年2月16日周六 上午1:22写道: > > I'm leaving the response on the CRUSH rule for Gregory, but you have another > problem you're running into that is causing more of this data to stay on this > node than you intend. While you `out` the OSD it is still contributing to > the Host's weight. So the host is still set to receive that amount of data > and distribute it among the disks inside of it. This is the default behavior > (even if you `destroy` the OSD) to minimize the data movement for losing the > disk and again for adding it back into the cluster after you replace the > device. If you are really strapped for space, though, then you might > consider fully purging the OSD which will reduce the Host weight to what the > other OSDs are. However if you do have a problem in your CRUSH rule, then > doing this won't change anything for you. > > On Thu, Feb 14, 2019 at 11:15 PM hnuzhoulin2 <[email protected]> wrote: >> >> Thanks. I read the your reply in >> https://www.mail-archive.com/[email protected]/msg48717.html >> so using indep will do fewer data remap when osd failed. >> using firstn: 1, 2, 3, 4, 5 -> 1, 2, 4, 5, 6 , 60% data remap >> using indep :1, 2, 3, 4, 5 -> 1, 2, 6, 4, 5, 25% data remap >> >> am I right? >> if so, what recommend to do when a disk failed and the total available size >> of the rest disk in the machine is not enough(can not replace failed disk >> immediately). or I should reserve more available size in EC situation. >> >> On 02/14/2019 02:49,Gregory Farnum<[email protected]> wrote: >> >> Your CRUSH rule for EC spools is forcing that behavior with the line >> >> step chooseleaf indep 1 type ctnr >> >> If you want different behavior, you’ll need a different crush rule. >> >> On Tue, Feb 12, 2019 at 5:18 PM hnuzhoulin2 <[email protected]> wrote: >>> >>> Hi, cephers >>> >>> >>> I am building a ceph EC cluster.when a disk is error,I out it.But its all >>> PGs remap to the osds in the same host,which I think they should remap to >>> other hosts in the same rack. >>> test process is: >>> >>> ceph osd pool create .rgw.buckets.data 8192 8192 erasure ISA-4-2 >>> site1_sata_erasure_ruleset 400000000 >>> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/1 >>> /etc/init.d/ceph stop osd.2 >>> ceph osd out 2 >>> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/2 >>> diff /tmp/1 /tmp/2 -y --suppress-common-lines >>> >>> 0 1.00000 1.00000 118 osd.0 | 0 1.00000 1.00000 126 osd.0 >>> 1 1.00000 1.00000 123 osd.1 | 1 1.00000 1.00000 139 osd.1 >>> 2 1.00000 1.00000 122 osd.2 | 2 1.00000 0 0 osd.2 >>> 3 1.00000 1.00000 113 osd.3 | 3 1.00000 1.00000 131 osd.3 >>> 4 1.00000 1.00000 122 osd.4 | 4 1.00000 1.00000 136 osd.4 >>> 5 1.00000 1.00000 112 osd.5 | 5 1.00000 1.00000 127 osd.5 >>> 6 1.00000 1.00000 114 osd.6 | 6 1.00000 1.00000 128 osd.6 >>> 7 1.00000 1.00000 124 osd.7 | 7 1.00000 1.00000 136 osd.7 >>> 8 1.00000 1.00000 95 osd.8 | 8 1.00000 1.00000 113 osd.8 >>> 9 1.00000 1.00000 112 osd.9 | 9 1.00000 1.00000 119 osd.9 >>> TOTAL 3073T 197G | TOTAL 3065T 197G >>> MIN/MAX VAR: 0.84/26.56 | MIN/MAX VAR: 0.84/26.52 >>> >>> >>> some config info: (detail configs see: >>> https://gist.github.com/hnuzhoulin/575883dbbcb04dff448eea3b9384c125) >>> jewel 10.2.11 filestore+rocksdb >>> >>> ceph osd erasure-code-profile get ISA-4-2 >>> k=4 >>> m=2 >>> plugin=isa >>> ruleset-failure-domain=ctnr >>> ruleset-root=site1-sata >>> technique=reed_sol_van >>> >>> part of ceph.conf is: >>> >>> [global] >>> fsid = 1CAB340D-E551-474F-B21A-399AC0F10900 >>> auth cluster required = cephx >>> auth service required = cephx >>> auth client required = cephx >>> pid file = /home/ceph/var/run/$name.pid >>> log file = /home/ceph/log/$cluster-$name.log >>> mon osd nearfull ratio = 0.85 >>> mon osd full ratio = 0.95 >>> admin socket = /home/ceph/var/run/$cluster-$name.asok >>> osd pool default size = 3 >>> osd pool default min size = 1 >>> osd objectstore = filestore >>> filestore merge threshold = -10 >>> >>> [mon] >>> keyring = /home/ceph/var/lib/$type/$cluster-$id/keyring >>> mon data = /home/ceph/var/lib/$type/$cluster-$id >>> mon cluster log file = /home/ceph/log/$cluster.log >>> [osd] >>> keyring = /home/ceph/var/lib/$type/$cluster-$id/keyring >>> osd data = /home/ceph/var/lib/$type/$cluster-$id >>> osd journal = /home/ceph/var/lib/$type/$cluster-$id/journal >>> osd journal size = 10000 >>> osd mkfs type = xfs >>> osd mount options xfs = rw,noatime,nodiratime,inode64,logbsize=256k >>> osd backfill full ratio = 0.92 >>> osd failsafe full ratio = 0.95 >>> osd failsafe nearfull ratio = 0.85 >>> osd max backfills = 1 >>> osd crush update on start = false >>> osd op thread timeout = 60 >>> filestore split multiple = 8 >>> filestore max sync interval = 15 >>> filestore min sync interval = 5 >>> [osd.0] >>> host = cld-osd1-56 >>> addr = XXXXX >>> user = ceph >>> devs = /disk/link/osd-0/data >>> osd journal = /disk/link/osd-0/journal >>> ……. >>> [osd.503] >>> host = cld-osd42-56 >>> addr = 10.108.87.52 >>> user = ceph >>> devs = /disk/link/osd-503/data >>> osd journal = /disk/link/osd-503/journal >>> >>> >>> crushmap is below: >>> >>> # begin crush map >>> tunable choose_local_tries 0 >>> tunable choose_local_fallback_tries 0 >>> tunable choose_total_tries 50 >>> tunable chooseleaf_descend_once 1 >>> tunable chooseleaf_vary_r 1 >>> tunable straw_calc_version 1 >>> tunable allowed_bucket_algs 54 >>> >>> # devices >>> device 0 osd.0 >>> device 1 osd.1 >>> device 2 osd.2 >>> 。。。 >>> device 502 osd.502 >>> device 503 osd.503 >>> >>> # types >>> type 0 osd # osd >>> type 1 ctnr # sata/ssd group by node, -101~1xx/-201~2xx >>> type 2 media # sata/ssd group by rack, -11~1x/-21~2x >>> type 3 mediagroup # sata/ssd group by site, -5/-6 >>> type 4 unit # site, -2 >>> type 5 root # root, -1 >>> >>> # buckets >>> ctnr cld-osd1-56-sata { >>> id -101 # do not change unnecessarily >>> # weight 10.000 >>> alg straw2 >>> hash 0 # rjenkins1 >>> item osd.0 weight 1.000 >>> item osd.1 weight 1.000 >>> item osd.2 weight 1.000 >>> item osd.3 weight 1.000 >>> item osd.4 weight 1.000 >>> item osd.5 weight 1.000 >>> item osd.6 weight 1.000 >>> item osd.7 weight 1.000 >>> item osd.8 weight 1.000 >>> item osd.9 weight 1.000 >>> } >>> ctnr cld-osd1-56-ssd { >>> id -201 # do not change unnecessarily >>> # weight 2.000 >>> alg straw2 >>> hash 0 # rjenkins1 >>> item osd.10 weight 1.000 >>> item osd.11 weight 1.000 >>> } >>> ….. >>> ctnr cld-osd41-56-sata { >>> id -141 # do not change unnecessarily >>> # weight 10.000 >>> alg straw2 >>> hash 0 # rjenkins1 >>> item osd.480 weight 1.000 >>> item osd.481 weight 1.000 >>> item osd.482 weight 1.000 >>> item osd.483 weight 1.000 >>> item osd.484 weight 1.000 >>> item osd.485 weight 1.000 >>> item osd.486 weight 1.000 >>> item osd.487 weight 1.000 >>> item osd.488 weight 1.000 >>> item osd.489 weight 1.000 >>> } >>> ctnr cld-osd41-56-ssd { >>> id -241 # do not change unnecessarily >>> # weight 2.000 >>> alg straw2 >>> hash 0 # rjenkins1 >>> item osd.490 weight 1.000 >>> item osd.491 weight 1.000 >>> } >>> ctnr cld-osd42-56-sata { >>> id -142 # do not change unnecessarily >>> # weight 10.000 >>> alg straw2 >>> hash 0 # rjenkins1 >>> item cld-osd29-56-sata weight 10.000 >>> item cld-osd30-56-sata weight 10.000 >>> item cld-osd31-56-sata weight 10.000 >>> item cld-osd32-56-sata weight 10.000 >>> item cld-osd33-56-sata weight 10.000 >>> item cld-osd34-56-sata weight 10.000 >>> item cld-osd35-56-sata weight 10.000 >>> } >>> >>> >>> media site1-rack1-sata { >>> id -11 # do not change unnecessarily >>> # weight 70.000 >>> alg straw2 >>> hash 0 # rjenkins1 >>> item cld-osd1-56-sata weight 10.000 >>> item cld-osd2-56-sata weight 10.000 >>> item cld-osd3-56-sata weight 10.000 >>> item cld-osd4-56-sata weight 10.000 >>> item cld-osd5-56-sata weight 10.000 >>> item cld-osd6-56-sata weight 10.000 >>> item cld-osd7-56-sata weight 10.000 >>> } >>> media site1-rack2-sata { >>> id -12 # do not change unnecessarily >>> # weight 70.000 >>> alg straw2 >>> hash 0 # rjenkins1 >>> item cld-osd8-56-sata weight 10.000 >>> item cld-osd9-56-sata weight 10.000 >>> item cld-osd10-56-sata weight 10.000 >>> item cld-osd11-56-sata weight 10.000 >>> item cld-osd12-56-sata weight 10.000 >>> item cld-osd13-56-sata weight 10.000 >>> item cld-osd14-56-sata weight 10.000 >>> } >>> media site1-rack3-sata { >>> id -13 # do not change unnecessarily >>> # weight 70.000 >>> alg straw2 >>> hash 0 # rjenkins1 >>> item cld-osd15-56-sata weight 10.000 >>> item cld-osd16-56-sata weight 10.000 >>> item cld-osd17-56-sata weight 10.000 >>> item cld-osd18-56-sata weight 10.000 >>> item cld-osd19-56-sata weight 10.000 >>> item cld-osd20-56-sata weight 10.000 >>> item cld-osd21-56-sata weight 10.000 >>> } >>> media site1-rack4-sata { >>> id -14 # do not change unnecessarily >>> # weight 70.000 >>> alg straw2 >>> hash 0 # rjenkins1 >>> item cld-osd22-56-sata weight 10.000 >>> item cld-osd23-56-sata weight 10.000 >>> item cld-osd24-56-sata weight 10.000 >>> item cld-osd25-56-sata weight 10.000 >>> item cld-osd26-56-sata weight 10.000 >>> item cld-osd27-56-sata weight 10.000 >>> item cld-osd28-56-sata weight 10.000 >>> } >>> media site1-rack5-sata { >>> id -15 # do not change unnecessarily >>> # weight 70.000 >>> alg straw2 >>> hash 0 # rjenkins1 >>> item cld-osd29-56-sata weight 10.000 >>> item cld-osd30-56-sata weight 10.000 >>> item cld-osd31-56-sata weight 10.000 >>> item cld-osd32-56-sata weight 10.000 >>> item cld-osd33-56-sata weight 10.000 >>> item cld-osd34-56-sata weight 10.000 >>> item cld-osd35-56-sata weight 10.000 >>> } >>> media site1-rack6-sata { >>> id -16 # do not change unnecessarily >>> # weight 70.000 >>> alg straw2 >>> hash 0 # rjenkins1 >>> item cld-osd36-56-sata weight 10.000 >>> item cld-osd37-56-sata weight 10.000 >>> item cld-osd38-56-sata weight 10.000 >>> item cld-osd39-56-sata weight 10.000 >>> item cld-osd40-56-sata weight 10.000 >>> item cld-osd41-56-sata weight 10.000 >>> item cld-osd42-56-sata weight 10.000 >>> } >>> >>> media site1-rack1-ssd { >>> id -21 # do not change unnecessarily >>> # weight 14.000 >>> alg straw2 >>> hash 0 # rjenkins1 >>> item cld-osd1-56-ssd weight 2.000 >>> item cld-osd2-56-ssd weight 2.000 >>> item cld-osd3-56-ssd weight 2.000 >>> item cld-osd4-56-ssd weight 2.000 >>> item cld-osd5-56-ssd weight 2.000 >>> item cld-osd6-56-ssd weight 2.000 >>> item cld-osd7-56-ssd weight 2.000 >>> item cld-osd8-56-ssd weight 2.000 >>> item cld-osd9-56-ssd weight 2.000 >>> item cld-osd10-56-ssd weight 2.000 >>> item cld-osd11-56-ssd weight 2.000 >>> item cld-osd12-56-ssd weight 2.000 >>> item cld-osd13-56-ssd weight 2.000 >>> item cld-osd14-56-ssd weight 2.000 >>> } >>> media site1-rack2-ssd { >>> id -22 # do not change unnecessarily >>> # weight 14.000 >>> alg straw2 >>> hash 0 # rjenkins1 >>> item cld-osd15-56-ssd weight 2.000 >>> item cld-osd16-56-ssd weight 2.000 >>> item cld-osd17-56-ssd weight 2.000 >>> item cld-osd18-56-ssd weight 2.000 >>> item cld-osd19-56-ssd weight 2.000 >>> item cld-osd20-56-ssd weight 2.000 >>> item cld-osd21-56-ssd weight 2.000 >>> item cld-osd22-56-ssd weight 2.000 >>> item cld-osd23-56-ssd weight 2.000 >>> item cld-osd24-56-ssd weight 2.000 >>> item cld-osd25-56-ssd weight 2.000 >>> item cld-osd26-56-ssd weight 2.000 >>> item cld-osd27-56-ssd weight 2.000 >>> item cld-osd28-56-ssd weight 2.000 >>> } >>> media site1-rack3-ssd { >>> id -23 # do not change unnecessarily >>> # weight 14.000 >>> alg straw2 >>> hash 0 # rjenkins1 >>> item cld-osd29-56-ssd weight 2.000 >>> item cld-osd30-56-ssd weight 2.000 >>> item cld-osd31-56-ssd weight 2.000 >>> item cld-osd32-56-ssd weight 2.000 >>> item cld-osd33-56-ssd weight 2.000 >>> item cld-osd34-56-ssd weight 2.000 >>> item cld-osd35-56-ssd weight 2.000 >>> item cld-osd36-56-ssd weight 2.000 >>> item cld-osd37-56-ssd weight 2.000 >>> item cld-osd38-56-ssd weight 2.000 >>> item cld-osd39-56-ssd weight 2.000 >>> item cld-osd40-56-ssd weight 2.000 >>> item cld-osd41-56-ssd weight 2.000 >>> item cld-osd42-56-ssd weight 2.000 >>> } >>> mediagroup site1-sata { >>> id -5 # do not change unnecessarily >>> # weight 420.000 >>> alg straw2 >>> hash 0 # rjenkins1 >>> item site1-rack1-sata weight 70.000 >>> item site1-rack2-sata weight 70.000 >>> item site1-rack3-sata weight 70.000 >>> item site1-rack4-sata weight 70.000 >>> item site1-rack5-sata weight 70.000 >>> item site1-rack6-sata weight 70.000 >>> } >>> mediagroup site1-ssd { >>> id -6 # do not change unnecessarily >>> # weight 84.000 >>> alg straw2 >>> hash 0 # rjenkins1 >>> item site1-rack1-ssd weight 28.000 >>> item site1-rack2-ssd weight 28.000 >>> item site1-rack3-ssd weight 28.000 >>> } >>> >>> unit site1 { >>> id -2 # do not change unnecessarily >>> # weight 504.000 >>> alg straw2 >>> hash 0 # rjenkins1 >>> item site1-sata weight 420.000 >>> item site1-ssd weight 84.000 >>> } >>> >>> root default { >>> id -1 # do not change unnecessarily >>> # weight 504.000 >>> alg straw2 >>> hash 0 # rjenkins1 >>> item site1 weight 504.000 >>> } >>> # rules >>> rule site1_sata_erasure_ruleset { >>> ruleset 0 >>> type erasure >>> min_size 3 >>> max_size 6 >>> step set_chooseleaf_tries 5 >>> step set_choose_tries 100 >>> step take site1-sata >>> step choose indep 0 type media >>> step chooseleaf indep 1 type ctnr >>> step emit >>> } >>> rule site1_ssd_replicated_ruleset { >>> ruleset 1 >>> type replicated >>> min_size 1 >>> max_size 10 >>> step take site1-ssd >>> step choose firstn 0 type media >>> step chooseleaf firstn 1 type ctnr >>> step emit >>> } >>> # end crush map >>> >>> _______________________________________________ >>> ceph-users mailing list >>> [email protected] >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> _______________________________________________ >> ceph-users mailing list >> [email protected] >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
