Re: [ceph-users] Understanding incomplete PGs
Hi The "ec unable to recover when below min size" thing has very recently been fixed for octopus. See https://tracker.ceph.com/issues/18749 and https://github.com/ceph/ceph/pull/17619 Docs has been updated with a section on this issue http://docs.ceph.com/docs/master/rados/operations/erasure-code/#erasure-coded-pool-recover [2] /Torben On 05.07.2019 11:50, Paul Emmerich wrote: > * There are virtually no use cases for ec pools with m=1, this is a bad > configuration as you can't have both availability and durability > > * Due to weird internal restrictions ec pools below their min size can't > recover, you'll probably have to reduce min_size temporarily to recover it > > * Depending on your version it might be necessary to restart some of the OSDs > due to a bug (fixed by now) that caused it to mark some objects as degraded > if you remove or restart an OSD while you have remapped objects > * run "ceph osd safe-to-destroy X" to check if it's safe to destroy a given > OSD > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io [1] > Tel: +49 89 1896585 90 > > On Fri, Jul 5, 2019 at 1:17 AM Kyle wrote: > >> Hello, >> >> I'm working with a small ceph cluster (about 10TB, 7-9 OSDs, all Bluestore >> on >> lvm) and recently ran into a problem with 17 pgs marked as incomplete after >> adding/removing OSDs. >> >> Here's the sequence of events: >> 1. 7 osds in the cluster, health is OK, all pgs are active+clean >> 2. 3 new osds on a new host are added, lots of backfilling in progress >> 3. osd 6 needs to be removed, so we do "ceph osd crush reweight osd.6 0" >> 4. after a few hours we see "min osd.6 with 0 pgs" from "ceph osd >> utilization" >> 5. ceph osd out 6 >> 6. systemctl stop ceph-osd@6 >> 7. the drive backing osd 6 is pulled and wiped >> 8. backfilling has now finished all pgs are active+clean except for 17 >> incomplete pgs >> >> From reading the docs, it sounds like there has been unrecoverable data loss >> in those 17 pgs. That raises some questions for me: >> >> Was "ceph osd utilization" only showing a goal of 0 pgs allocated instead of >> the current actual allocation? >> >> Why is there data loss from a single osd being removed? Shouldn't that be >> recoverable? >> All pools in the cluster are either replicated 3 or erasure-coded k=2,m=1 >> with >> default "host" failure domain. They shouldn't suffer data loss with a single >> osd being removed even if there were no reweighting beforehand. Does the >> backfilling temporarily reduce data durability in some way? >> >> Is there a way to see which pgs actually have data on a given osd? >> >> I attached an example of one of the incomplete pgs. >> >> Thanks for any help, >> >> Kyle___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Links: -- [1] http://www.croit.io [2] http://docs.ceph.com/docs/master/rados/operations/erasure-code/#erasure-coded-pool-recovery___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Understanding incomplete PGs
On Friday, July 5, 2019 11:50:44 AM CDT Paul Emmerich wrote: > * There are virtually no use cases for ec pools with m=1, this is a bad > configuration as you can't have both availability and durability I'll have to look into this more. The cluster only has 4 hosts, so it might be worth switching to osd failure domain for the EC pools and using k=5,m=2. > > * Due to weird internal restrictions ec pools below their min size can't > recover, you'll probably have to reduce min_size temporarily to recover it Lowering min_size to 2 did allow it to recover. > > * Depending on your version it might be necessary to restart some of the > OSDs due to a bug (fixed by now) that caused it to mark some objects as > degraded if you remove or restart an OSD while you have remapped objects > > * run "ceph osd safe-to-destroy X" to check if it's safe to destroy a given > OSD Excellent, thanks! > > > Hello, > > > > I'm working with a small ceph cluster (about 10TB, 7-9 OSDs, all Bluestore > > on > > lvm) and recently ran into a problem with 17 pgs marked as incomplete > > after > > adding/removing OSDs. > > > > Here's the sequence of events: > > 1. 7 osds in the cluster, health is OK, all pgs are active+clean > > 2. 3 new osds on a new host are added, lots of backfilling in progress > > 3. osd 6 needs to be removed, so we do "ceph osd crush reweight osd.6 0" > > 4. after a few hours we see "min osd.6 with 0 pgs" from "ceph osd > > utilization" > > 5. ceph osd out 6 > > 6. systemctl stop ceph-osd@6 > > 7. the drive backing osd 6 is pulled and wiped > > 8. backfilling has now finished all pgs are active+clean except for 17 > > incomplete pgs > > > > From reading the docs, it sounds like there has been unrecoverable data > > loss > > in those 17 pgs. That raises some questions for me: > > > > Was "ceph osd utilization" only showing a goal of 0 pgs allocated instead > > of > > the current actual allocation? > > > > Why is there data loss from a single osd being removed? Shouldn't that be > > recoverable? > > All pools in the cluster are either replicated 3 or erasure-coded k=2,m=1 > > with > > default "host" failure domain. They shouldn't suffer data loss with a > > single > > osd being removed even if there were no reweighting beforehand. Does the > > backfilling temporarily reduce data durability in some way? > > > > Is there a way to see which pgs actually have data on a given osd? > > > > I attached an example of one of the incomplete pgs. > > > > Thanks for any help, > > > > Kyle___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Understanding incomplete PGs
On Friday, July 5, 2019 11:28:32 AM CDT Caspar Smit wrote: > Kyle, > > Was the cluster still backfilling when you removed osd 6 or did you only > check its utilization? Yes, still backfilling. > > Running an EC pool with m=1 is a bad idea. EC pool min_size = k+1 so losing > a single OSD results in inaccessible data. > Your incomplete PG's are probably all EC pool pgs, please verify. Yes, also correct. > > If the above statement is true, you could *temporarily* set min_size to 2 > (on your EC pools) to get back access to your data again but this is a very > dangerous action. Losing another OSD during this period results in actual > data loss. This resolved the issue. I had seen reducing min_size mentioned elsewhere, but for some reason I thought that applied only to replicated pools. Thank you! > > Kind regards, > Caspar Smit > > Op vr 5 jul. 2019 om 01:17 schreef Kyle : > > Hello, > > > > I'm working with a small ceph cluster (about 10TB, 7-9 OSDs, all Bluestore > > on > > lvm) and recently ran into a problem with 17 pgs marked as incomplete > > after > > adding/removing OSDs. > > > > Here's the sequence of events: > > 1. 7 osds in the cluster, health is OK, all pgs are active+clean > > 2. 3 new osds on a new host are added, lots of backfilling in progress > > 3. osd 6 needs to be removed, so we do "ceph osd crush reweight osd.6 0" > > 4. after a few hours we see "min osd.6 with 0 pgs" from "ceph osd > > utilization" > > 5. ceph osd out 6 > > 6. systemctl stop ceph-osd@6 > > 7. the drive backing osd 6 is pulled and wiped > > 8. backfilling has now finished all pgs are active+clean except for 17 > > incomplete pgs > > > > From reading the docs, it sounds like there has been unrecoverable data > > loss > > in those 17 pgs. That raises some questions for me: > > > > Was "ceph osd utilization" only showing a goal of 0 pgs allocated instead > > of > > the current actual allocation? > > > > Why is there data loss from a single osd being removed? Shouldn't that be > > recoverable? > > All pools in the cluster are either replicated 3 or erasure-coded k=2,m=1 > > with > > default "host" failure domain. They shouldn't suffer data loss with a > > single > > osd being removed even if there were no reweighting beforehand. Does the > > backfilling temporarily reduce data durability in some way? > > > > Is there a way to see which pgs actually have data on a given osd? > > > > I attached an example of one of the incomplete pgs. > > > > Thanks for any help, > > > > Kyle___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Understanding incomplete PGs
* There are virtually no use cases for ec pools with m=1, this is a bad configuration as you can't have both availability and durability * Due to weird internal restrictions ec pools below their min size can't recover, you'll probably have to reduce min_size temporarily to recover it * Depending on your version it might be necessary to restart some of the OSDs due to a bug (fixed by now) that caused it to mark some objects as degraded if you remove or restart an OSD while you have remapped objects * run "ceph osd safe-to-destroy X" to check if it's safe to destroy a given OSD -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Fri, Jul 5, 2019 at 1:17 AM Kyle wrote: > Hello, > > I'm working with a small ceph cluster (about 10TB, 7-9 OSDs, all Bluestore > on > lvm) and recently ran into a problem with 17 pgs marked as incomplete > after > adding/removing OSDs. > > Here's the sequence of events: > 1. 7 osds in the cluster, health is OK, all pgs are active+clean > 2. 3 new osds on a new host are added, lots of backfilling in progress > 3. osd 6 needs to be removed, so we do "ceph osd crush reweight osd.6 0" > 4. after a few hours we see "min osd.6 with 0 pgs" from "ceph osd > utilization" > 5. ceph osd out 6 > 6. systemctl stop ceph-osd@6 > 7. the drive backing osd 6 is pulled and wiped > 8. backfilling has now finished all pgs are active+clean except for 17 > incomplete pgs > > From reading the docs, it sounds like there has been unrecoverable data > loss > in those 17 pgs. That raises some questions for me: > > Was "ceph osd utilization" only showing a goal of 0 pgs allocated instead > of > the current actual allocation? > > Why is there data loss from a single osd being removed? Shouldn't that be > recoverable? > All pools in the cluster are either replicated 3 or erasure-coded k=2,m=1 > with > default "host" failure domain. They shouldn't suffer data loss with a > single > osd being removed even if there were no reweighting beforehand. Does the > backfilling temporarily reduce data durability in some way? > > Is there a way to see which pgs actually have data on a given osd? > > I attached an example of one of the incomplete pgs. > > Thanks for any help, > > Kyle___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Understanding incomplete PGs
Kyle, Was the cluster still backfilling when you removed osd 6 or did you only check its utilization? Running an EC pool with m=1 is a bad idea. EC pool min_size = k+1 so losing a single OSD results in inaccessible data. Your incomplete PG's are probably all EC pool pgs, please verify. If the above statement is true, you could *temporarily* set min_size to 2 (on your EC pools) to get back access to your data again but this is a very dangerous action. Losing another OSD during this period results in actual data loss. Kind regards, Caspar Smit Op vr 5 jul. 2019 om 01:17 schreef Kyle : > Hello, > > I'm working with a small ceph cluster (about 10TB, 7-9 OSDs, all Bluestore > on > lvm) and recently ran into a problem with 17 pgs marked as incomplete > after > adding/removing OSDs. > > Here's the sequence of events: > 1. 7 osds in the cluster, health is OK, all pgs are active+clean > 2. 3 new osds on a new host are added, lots of backfilling in progress > 3. osd 6 needs to be removed, so we do "ceph osd crush reweight osd.6 0" > 4. after a few hours we see "min osd.6 with 0 pgs" from "ceph osd > utilization" > 5. ceph osd out 6 > 6. systemctl stop ceph-osd@6 > 7. the drive backing osd 6 is pulled and wiped > 8. backfilling has now finished all pgs are active+clean except for 17 > incomplete pgs > > From reading the docs, it sounds like there has been unrecoverable data > loss > in those 17 pgs. That raises some questions for me: > > Was "ceph osd utilization" only showing a goal of 0 pgs allocated instead > of > the current actual allocation? > > Why is there data loss from a single osd being removed? Shouldn't that be > recoverable? > All pools in the cluster are either replicated 3 or erasure-coded k=2,m=1 > with > default "host" failure domain. They shouldn't suffer data loss with a > single > osd being removed even if there were no reweighting beforehand. Does the > backfilling temporarily reduce data durability in some way? > > Is there a way to see which pgs actually have data on a given osd? > > I attached an example of one of the incomplete pgs. > > Thanks for any help, > > Kyle___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com