Re: [ceph-users] Degraded objects afte: ceph osd in $osd
I reply to myself. > I've added a new node, added slowly 4 new OSD, but in the meantime an > OSD (not the new, not the node to remove) died. My situation now is: > root@blackpanther:~# ceph osd df tree > ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR TYPE NAME > -1 21.41985- 5586G 2511G 3074G 00 root default > -2 5.45996- 5586G 2371G 3214G 42.45 0.93 host capitanamerica > 0 1.81999 1.0 1862G 739G 1122G 39.70 0.87 osd.0 > 1 1.81999 1.0 1862G 856G 1005G 46.00 1.00 osd.1 > 10 0.90999 1.0 931G 381G 549G 40.95 0.89 osd.10 > 11 0.90999 1.0 931G 394G 536G 42.35 0.92 osd.11 > -3 5.03996- 5586G 2615G 2970G 46.82 1.02 host vedovanera > 2 1.3 1.0 1862G 684G 1177G 36.78 0.80 osd.2 > 3 1.81999 1.0 1862G 1081G 780G 58.08 1.27 osd.3 > 4 0.90999 1.0 931G 412G 518G 44.34 0.97 osd.4 > 5 0.90999 1.0 931G 436G 494G 46.86 1.02 osd.5 > -4 5.45996- 931G 583G 347G 00 host deadpool > 6 1.81999 1.0 1862G 898G 963G 48.26 1.05 osd.6 > 7 1.81999 1.0 1862G 839G 1022G 45.07 0.98 osd.7 > 8 0.909990 0 0 0 00 osd.8 > 9 0.90999 1.0 931G 583G 347G 62.64 1.37 osd.9 > -5 5.45996- 5586G 2511G 3074G 44.96 0.98 host blackpanther > 12 1.81999 1.0 1862G 828G 1033G 44.51 0.97 osd.12 > 13 1.81999 1.0 1862G 753G 1108G 40.47 0.88 osd.13 > 14 0.90999 1.0 931G 382G 548G 41.11 0.90 osd.14 > 15 0.90999 1.0 931G 546G 384G 58.66 1.28 osd.15 > TOTAL 21413G 9819G 11594G 45.85 > MIN/MAX VAR: 0/1.37 STDDEV: 7.37 > > Perfectly healthy. But i've tried to, slowly, remove an OSD from > 'vedovanera', and so i've tried with: > ceph osd crush reweight osd.2 > as you can see, i'm arrived to weight 1.4 (from 1.81999), but if i go > lower than that i catch: [...] > recovery 2/2556513 objects degraded (0.000%) Seems that the trouble came from osd.8 that was out and down, but not from the crushmap (still have weight 0.90999). After removing osd 8 massive rebalance start. After that, now i can lower weight of OSD for node vedovanera and i've no more degraded object. I think i'm starting to understand how concretely the crush algorithm work. ;-) -- dott. Marco Gaiarin GNUPG Key ID: 240A3D66 Associazione ``La Nostra Famiglia'' http://www.lanostrafamiglia.it/ Polo FVG - Via della Bontà, 7 - 33078 - San Vito al Tagliamento (PN) marco.gaiarin(at)lanostrafamiglia.it t +39-0434-842711 f +39-0434-842797 Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA! http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000 (cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Degraded objects afte: ceph osd in $osd
On Mon, Nov 26, 2018 at 3:30 AM Janne Johansson wrote: > Den sön 25 nov. 2018 kl 22:10 skrev Stefan Kooman : > > > > Hi List, > > > > Another interesting and unexpected thing we observed during cluster > > expansion is the following. After we added extra disks to the cluster, > > while "norebalance" flag was set, we put the new OSDs "IN". As soon as > > we did that a couple of hundered objects would become degraded. During > > that time no OSD crashed or restarted. Every "ceph osd crush add $osd > > weight host=$storage-node" would cause extra degraded objects. > > > > I don't expect objects to become degraded when extra OSDs are added. > > Misplaced, yes. Degraded, no > > > > Someone got an explantion for this? > > > > Yes, when you add a drive (or 10), some PGs decide they should have one or > more > replicas on the new drives, a new empty PG is created there, and > _then_ that replica > will make that PG get into the "degraded" mode, meaning if it had 3 > fine active+clean > replicas before, it now has 2 active+clean and one needing backfill to > get into shape. > > It is a slight mistake in reporting it in the same way as an error, > even if it looks to the > cluster just as if it was in error and needs fixing. This gives the > new ceph admins a > sense of urgency or danger whereas it should be perfectly normal to add > space to > a cluster. Also, it could have chosen to add a fourth PG in a repl=3 > PG and fill from > the one going out into the new empty PG and somehow keep itself with 3 > working > replicas, but ceph chooses to first discard one replica, then backfill > into the empty > one, leading to this kind of "error" report. > See, that's the thing: Ceph is designed *not* to reduce data reliability this way; it shouldn't do that; and so far as I've been able to establish so far it doesn't actually do that. Which makes these degraded object reports a bit perplexing. What we have worked out is that sometimes objects can be degraded because the log-based recovery takes a while after the primary juggles around PG set membership, and I suspect that's what is turning up here. The exact cause still eludes me a bit, but I assume it's a consequence of the backfill and recovery throttling we've added over the years. If a whole PG was missing then you'd expect to see very large degraded object counts (as opposed to the 2 that Marco reported). -Greg > > -- > May the most significant bit of your life be positive. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Degraded objects afte: ceph osd in $osd
Mandi! Janne Johansson In chel di` si favelave... > It is a slight mistake in reporting it in the same way as an error, even if > it looks to the > cluster just as if it was in error and needs fixing. I think i'm hit a similar situation, and also i'm feeling that something have to be 'fixed'. I seek an explanation... I'm adding a node (blackpanther, 4 OSDs, done) and removing a node (vedovanera[1], 4 OSDs, to be done). I've added a new node, added slowly 4 new OSD, but in the meantime an OSD (not the new, not the node to remove) died. My situation now is: root@blackpanther:~# ceph osd df tree ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR TYPE NAME -1 21.41985- 5586G 2511G 3074G 00 root default -2 5.45996- 5586G 2371G 3214G 42.45 0.93 host capitanamerica 0 1.81999 1.0 1862G 739G 1122G 39.70 0.87 osd.0 1 1.81999 1.0 1862G 856G 1005G 46.00 1.00 osd.1 10 0.90999 1.0 931G 381G 549G 40.95 0.89 osd.10 11 0.90999 1.0 931G 394G 536G 42.35 0.92 osd.11 -3 5.03996- 5586G 2615G 2970G 46.82 1.02 host vedovanera 2 1.3 1.0 1862G 684G 1177G 36.78 0.80 osd.2 3 1.81999 1.0 1862G 1081G 780G 58.08 1.27 osd.3 4 0.90999 1.0 931G 412G 518G 44.34 0.97 osd.4 5 0.90999 1.0 931G 436G 494G 46.86 1.02 osd.5 -4 5.45996- 931G 583G 347G 00 host deadpool 6 1.81999 1.0 1862G 898G 963G 48.26 1.05 osd.6 7 1.81999 1.0 1862G 839G 1022G 45.07 0.98 osd.7 8 0.909990 0 0 0 00 osd.8 9 0.90999 1.0 931G 583G 347G 62.64 1.37 osd.9 -5 5.45996- 5586G 2511G 3074G 44.96 0.98 host blackpanther 12 1.81999 1.0 1862G 828G 1033G 44.51 0.97 osd.12 13 1.81999 1.0 1862G 753G 1108G 40.47 0.88 osd.13 14 0.90999 1.0 931G 382G 548G 41.11 0.90 osd.14 15 0.90999 1.0 931G 546G 384G 58.66 1.28 osd.15 TOTAL 21413G 9819G 11594G 45.85 MIN/MAX VAR: 0/1.37 STDDEV: 7.37 Perfectly healthy. But i've tried to, slowly, remove an OSD from 'vedovanera', and so i've tried with: ceph osd crush reweight osd.2 as you can see, i'm arrived to weight 1.4 (from 1.81999), but if i go lower than that i catch: cluster 8794c124-c2ec-4e81-8631-742992159bd6 health HEALTH_WARN 6 pgs backfill 1 pgs backfilling 7 pgs stuck unclean recovery 2/2556513 objects degraded (0.000%) recovery 7721/2556513 objects misplaced (0.302%) monmap e6: 6 mons at {0=10.27.251.7:6789/0,1=10.27.251.8:6789/0,2=10.27.251.11:6789/0,3=10.27.251.12:6789/0,4=10.27.251.9:6789/0,blackpanther=10.27.251.2:6789/0} election epoch 2780, quorum 0,1,2,3,4,5 blackpanther,0,1,4,2,3 osdmap e9302: 16 osds: 15 up, 15 in; 7 remapped pgs pgmap v54971897: 768 pgs, 3 pools, 3300 GB data, 830 kobjects 9911 GB used, 11502 GB / 21413 GB avail 2/2556513 objects degraded (0.000%) 7721/2556513 objects misplaced (0.302%) 761 active+clean 6 active+remapped+wait_backfill 1 active+remapped+backfilling client io 9725 kB/s rd, 772 kB/s wr, 153 op/s eg, 2 object 'degraded'. This really puzzled me. Why?! Thanks. [1] some Marvel Comics heros got translated in Italian, so 'vedovanera' is 'black widow' and 'capitanamerica' clearly 'Captain America'. -- dott. Marco Gaiarin GNUPG Key ID: 240A3D66 Associazione ``La Nostra Famiglia'' http://www.lanostrafamiglia.it/ Polo FVG - Via della Bontà, 7 - 33078 - San Vito al Tagliamento (PN) marco.gaiarin(at)lanostrafamiglia.it t +39-0434-842711 f +39-0434-842797 Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA! http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000 (cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Degraded objects afte: ceph osd in $osd
Den mån 26 nov. 2018 kl 09:39 skrev Stefan Kooman : > > It is a slight mistake in reporting it in the same way as an error, > > even if it looks to the > > cluster just as if it was in error and needs fixing. This gives the > > new ceph admins a > > sense of urgency or danger whereas it should be perfectly normal to add > > space to > > a cluster. Also, it could have chosen to add a fourth PG in a repl=3 > > PG and fill from > > the one going out into the new empty PG and somehow keep itself with 3 > > working > > replicas, but ceph chooses to first discard one replica, then backfill > > into the empty > > one, leading to this kind of "error" report. > > Thanks for the explanation. I agree with you that it would be more safe to > first backfill to the new PG instead of just assuming the new OSD will > be fine and discarding a perfectly healthy PG. We do have max_size 3 in > the CRUSH ruleset ... I wonder if Ceph would behave differently if we > would have max_size 4 ... to actually allow a fourth copy in the first > place ... I don't think the replication number is important, it's more of a choice which PERHAPS is meant to allow you to move PGs to a new drive when the cluster is near full, since it will clear out space lots faster if you just kill off one unneeded replica and starts writing to a new drive, whereas keeping all old replicas until data is 100% ok on the new replica will make new space not appear until a large amount of data has moved, which for large drives and large PGs might take a very long time. -- May the most significant bit of your life be positive. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Degraded objects afte: ceph osd in $osd
Quoting Janne Johansson (icepic...@gmail.com): > Yes, when you add a drive (or 10), some PGs decide they should have one or > more > replicas on the new drives, a new empty PG is created there, and > _then_ that replica > will make that PG get into the "degraded" mode, meaning if it had 3 > fine active+clean > replicas before, it now has 2 active+clean and one needing backfill to > get into shape. > > It is a slight mistake in reporting it in the same way as an error, > even if it looks to the > cluster just as if it was in error and needs fixing. This gives the > new ceph admins a > sense of urgency or danger whereas it should be perfectly normal to add space > to > a cluster. Also, it could have chosen to add a fourth PG in a repl=3 > PG and fill from > the one going out into the new empty PG and somehow keep itself with 3 working > replicas, but ceph chooses to first discard one replica, then backfill > into the empty > one, leading to this kind of "error" report. Thanks for the explanation. I agree with you that it would be more safe to first backfill to the new PG instead of just assuming the new OSD will be fine and discarding a perfectly healthy PG. We do have max_size 3 in the CRUSH ruleset ... I wonder if Ceph would behave differently if we would have max_size 4 ... to actually allow a fourth copy in the first place ... Gr. Stefan -- | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Degraded objects afte: ceph osd in $osd
Den sön 25 nov. 2018 kl 22:10 skrev Stefan Kooman : > > Hi List, > > Another interesting and unexpected thing we observed during cluster > expansion is the following. After we added extra disks to the cluster, > while "norebalance" flag was set, we put the new OSDs "IN". As soon as > we did that a couple of hundered objects would become degraded. During > that time no OSD crashed or restarted. Every "ceph osd crush add $osd > weight host=$storage-node" would cause extra degraded objects. > > I don't expect objects to become degraded when extra OSDs are added. > Misplaced, yes. Degraded, no > > Someone got an explantion for this? > Yes, when you add a drive (or 10), some PGs decide they should have one or more replicas on the new drives, a new empty PG is created there, and _then_ that replica will make that PG get into the "degraded" mode, meaning if it had 3 fine active+clean replicas before, it now has 2 active+clean and one needing backfill to get into shape. It is a slight mistake in reporting it in the same way as an error, even if it looks to the cluster just as if it was in error and needs fixing. This gives the new ceph admins a sense of urgency or danger whereas it should be perfectly normal to add space to a cluster. Also, it could have chosen to add a fourth PG in a repl=3 PG and fill from the one going out into the new empty PG and somehow keep itself with 3 working replicas, but ceph chooses to first discard one replica, then backfill into the empty one, leading to this kind of "error" report. -- May the most significant bit of your life be positive. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Degraded objects afte: ceph osd in $osd
Hi List, Another interesting and unexpected thing we observed during cluster expansion is the following. After we added extra disks to the cluster, while "norebalance" flag was set, we put the new OSDs "IN". As soon as we did that a couple of hundered objects would become degraded. During that time no OSD crashed or restarted. Every "ceph osd crush add $osd weight host=$storage-node" would cause extra degraded objects. I don't expect objects to become degraded when extra OSDs are added. Misplaced, yes. Degraded, no Someone got an explantion for this? Gr. Stefan -- | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com