Re: [ceph-users] unable to repair PG
Just to update this issue. I stopped OSD.6, removed the PG from disk, and restarted it. Ceph rebuilt the object and it went to HEALTH_OK. During the weekend the disk for OSD.6 started giving smart errors and will be replaced. Thanks for your help Greg. I've opened a bug report in the tracker. On Fri, Dec 12, 2014 at 9:53 PM, Gregory Farnum wrote: > > [Re-adding the list] > > Yeah, so "shard 6" means that it's osd.6 which has the bad data. > Apparently pg repair doesn't recover from this class of failures; if > you could file a bug that would be appreciated. > But anyway, if you delete the object in question from OSD 6 and run a > repair on the pg again it should recover just fine. > -Greg > > On Fri, Dec 12, 2014 at 1:45 PM, Luis Periquito > wrote: > > Running firefly 0.80.7 with a replicated pools, with 4 copies. > > > > On 12 Dec 2014 19:20, "Gregory Farnum" wrote: > >> > >> What version of Ceph are you running? Is this a replicated or > >> erasure-coded pool? > >> > >> On Fri, Dec 12, 2014 at 1:11 AM, Luis Periquito > >> wrote: > >> > Hi Greg, > >> > > >> > thanks for your help. It's always highly appreciated. :) > >> > > >> > On Thu, Dec 11, 2014 at 6:41 PM, Gregory Farnum > >> > wrote: > >> >> > >> >> On Thu, Dec 11, 2014 at 2:57 AM, Luis Periquito > > >> >> wrote: > >> >> > Hi, > >> >> > > >> >> > I've stopped OSD.16, removed the PG from the local filesystem and > >> >> > started > >> >> > the OSD again. After ceph rebuilt the PG in the removed OSD I ran a > >> >> > deep-scrub and the PG is still inconsistent. > >> >> > >> >> What led you to remove it from osd 16? Is that the one hosting the > log > >> >> you snipped from? Is osd 16 the one hosting shard 6 of that PG, or > was > >> >> it the primary? > >> > > >> > OSD 16 is both the primary for this PG and the one that has the > snipped > >> > log. > >> > The other 3 OSDs has any mention of this PG in their logs. Just some > >> > messages about slow requests and the backfill when I removed the > object. > >> > Actually it came from OSD.6 - currently we don't have OSD.3. > >> > > >> > this is the output of the pg dump for this PG > >> > 9.180256140002330648234830013001 > >> > active+clean+inconsistent2014-12-10 17:29:01.937929 > 40242'1108124 > >> > 40242:23305321[16,10,27,6]16[16,10,27,6]16 > 40242'1071363 > >> > 2014-12-10 17:29:01.93788140242'10713632014-12-10 > >> > 17:29:01.937881 > >> > > >> >> > >> >> Anyway, the message means that shard 6 (which I think is the seventh > >> >> OSD in the list) of PG 9.180 is missing a bunch of xattrs on object > >> >> 370cbf80/29145.4_xxx/head//9. I'm actually a little surprised it > >> >> didn't crash if it's missing the "_" attr > >> >> -Greg > >> > > >> > > >> > Any idea on how to fix it? > >> > > >> >> > >> >> > >> >> > > >> >> > I'm running out of ideas on trying to solve this. Does this mean > that > >> >> > all > >> >> > copies of the object should also be inconsistent? Should I just try > >> >> > to > >> >> > figure which object/bucket this belongs to and delete it/copy it > >> >> > again > >> >> > to > >> >> > the ceph cluster? > >> >> > > >> >> > Also, do you know what the error message means? is it just some > sort > >> >> > of > >> >> > metadata for this object that isn't correct, not the object itself? > >> >> > > >> >> > On Wed, Dec 10, 2014 at 11:11 AM, Luis Periquito > >> >> > > >> >> > wrote: > >> >> >> > >> >> >> Hi, > >> >> >> > >> >> >> In the last few days this PG (pool is .rgw.buckets) has been in > >> >> >> error > >> >> >> after running the scrub process. > >> >> >> > >> >> >> After getting the error, and trying to see what may be the issue > >> >> >> (and > >> >> >> finding none), I've just issued a ceph repair followed by a ceph > >> >> >> deep-scrub. > >> >> >> However it doesn't seem to have fixed the issue and it still > >> >> >> remains. > >> >> >> > >> >> >> The relevant log from the OSD is as follows. > >> >> >> > >> >> >> 2014-12-10 09:38:09.348110 7f8f618be700 0 log [ERR] : 9.180 > >> >> >> deep-scrub > >> >> >> 0 > >> >> >> missing, 1 inconsistent objects > >> >> >> 2014-12-10 09:38:09.348116 7f8f618be700 0 log [ERR] : 9.180 > >> >> >> deep-scrub > >> >> >> 1 > >> >> >> errors > >> >> >> 2014-12-10 10:13:15.922065 7f8f618be700 0 log [INF] : 9.180 > repair > >> >> >> ok, > >> >> >> 0 > >> >> >> fixed > >> >> >> 2014-12-10 10:55:27.556358 7f8f618be700 0 log [ERR] : 9.180 shard > >> >> >> 6: > >> >> >> soid > >> >> >> 370cbf80/29145.4_xxx/head//9 missing attr _, missing attr > >> >> >> _user.rgw.acl, > >> >> >> missing attr _user.rgw.content_type, missing attr _user.rgw.etag, > >> >> >> missing > >> >> >> attr _user.rgw.idtag, missing attr _user.rgw.manifest, missing > attr > >> >> >> _user.rgw.x-amz-meta-md5sum, missing attr > _user.rgw.x-amz-meta-stat, > >> >> >> missing > >> >> >> attr snapset > >> >> >> 2014-12-10 10:56:50.597952 7f8f618be700 0 log [ERR] : 9.180 > >> >> >> deep-scrub > >> >> >> 0 > >> >>
Re: [ceph-users] unable to repair PG
What version of Ceph are you running? Is this a replicated or erasure-coded pool? On Fri, Dec 12, 2014 at 1:11 AM, Luis Periquito wrote: > Hi Greg, > > thanks for your help. It's always highly appreciated. :) > > On Thu, Dec 11, 2014 at 6:41 PM, Gregory Farnum wrote: >> >> On Thu, Dec 11, 2014 at 2:57 AM, Luis Periquito >> wrote: >> > Hi, >> > >> > I've stopped OSD.16, removed the PG from the local filesystem and >> > started >> > the OSD again. After ceph rebuilt the PG in the removed OSD I ran a >> > deep-scrub and the PG is still inconsistent. >> >> What led you to remove it from osd 16? Is that the one hosting the log >> you snipped from? Is osd 16 the one hosting shard 6 of that PG, or was >> it the primary? > > OSD 16 is both the primary for this PG and the one that has the snipped log. > The other 3 OSDs has any mention of this PG in their logs. Just some > messages about slow requests and the backfill when I removed the object. > Actually it came from OSD.6 - currently we don't have OSD.3. > > this is the output of the pg dump for this PG > 9.180256140002330648234830013001 > active+clean+inconsistent2014-12-10 17:29:01.93792940242'1108124 > 40242:23305321[16,10,27,6]16[16,10,27,6]1640242'1071363 > 2014-12-10 17:29:01.93788140242'10713632014-12-10 17:29:01.937881 > >> >> Anyway, the message means that shard 6 (which I think is the seventh >> OSD in the list) of PG 9.180 is missing a bunch of xattrs on object >> 370cbf80/29145.4_xxx/head//9. I'm actually a little surprised it >> didn't crash if it's missing the "_" attr >> -Greg > > > Any idea on how to fix it? > >> >> >> > >> > I'm running out of ideas on trying to solve this. Does this mean that >> > all >> > copies of the object should also be inconsistent? Should I just try to >> > figure which object/bucket this belongs to and delete it/copy it again >> > to >> > the ceph cluster? >> > >> > Also, do you know what the error message means? is it just some sort of >> > metadata for this object that isn't correct, not the object itself? >> > >> > On Wed, Dec 10, 2014 at 11:11 AM, Luis Periquito >> > wrote: >> >> >> >> Hi, >> >> >> >> In the last few days this PG (pool is .rgw.buckets) has been in error >> >> after running the scrub process. >> >> >> >> After getting the error, and trying to see what may be the issue (and >> >> finding none), I've just issued a ceph repair followed by a ceph >> >> deep-scrub. >> >> However it doesn't seem to have fixed the issue and it still remains. >> >> >> >> The relevant log from the OSD is as follows. >> >> >> >> 2014-12-10 09:38:09.348110 7f8f618be700 0 log [ERR] : 9.180 deep-scrub >> >> 0 >> >> missing, 1 inconsistent objects >> >> 2014-12-10 09:38:09.348116 7f8f618be700 0 log [ERR] : 9.180 deep-scrub >> >> 1 >> >> errors >> >> 2014-12-10 10:13:15.922065 7f8f618be700 0 log [INF] : 9.180 repair ok, >> >> 0 >> >> fixed >> >> 2014-12-10 10:55:27.556358 7f8f618be700 0 log [ERR] : 9.180 shard 6: >> >> soid >> >> 370cbf80/29145.4_xxx/head//9 missing attr _, missing attr >> >> _user.rgw.acl, >> >> missing attr _user.rgw.content_type, missing attr _user.rgw.etag, >> >> missing >> >> attr _user.rgw.idtag, missing attr _user.rgw.manifest, missing attr >> >> _user.rgw.x-amz-meta-md5sum, missing attr _user.rgw.x-amz-meta-stat, >> >> missing >> >> attr snapset >> >> 2014-12-10 10:56:50.597952 7f8f618be700 0 log [ERR] : 9.180 deep-scrub >> >> 0 >> >> missing, 1 inconsistent objects >> >> 2014-12-10 10:56:50.597957 7f8f618be700 0 log [ERR] : 9.180 deep-scrub >> >> 1 >> >> errors >> >> >> >> I'm running version firefly 0.80.7. >> > >> > >> > >> > ___ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] unable to repair PG
Hi Greg, thanks for your help. It's always highly appreciated. :) On Thu, Dec 11, 2014 at 6:41 PM, Gregory Farnum wrote: > On Thu, Dec 11, 2014 at 2:57 AM, Luis Periquito > wrote: > > Hi, > > > > I've stopped OSD.16, removed the PG from the local filesystem and started > > the OSD again. After ceph rebuilt the PG in the removed OSD I ran a > > deep-scrub and the PG is still inconsistent. > > What led you to remove it from osd 16? Is that the one hosting the log > you snipped from? Is osd 16 the one hosting shard 6 of that PG, or was > it the primary? > OSD 16 is both the primary for this PG and the one that has the snipped log. The other 3 OSDs has any mention of this PG in their logs. Just some messages about slow requests and the backfill when I removed the object. Actually it came from OSD.6 - currently we don't have OSD.3. this is the output of the pg dump for this PG 9.180256140002330648234830013001 active+clean+inconsistent2014-12-10 17:29:01.93792940242'1108124 40242:23305321[16,10,27,6]16[16,10,27,6]1640242'1071363 2014-12-10 17:29:01.93788140242'10713632014-12-10 17:29:01.937881 > Anyway, the message means that shard 6 (which I think is the seventh > OSD in the list) of PG 9.180 is missing a bunch of xattrs on object > 370cbf80/29145.4_xxx/head//9. I'm actually a little surprised it > didn't crash if it's missing the "_" attr > -Greg > Any idea on how to fix it? > > > > > I'm running out of ideas on trying to solve this. Does this mean that all > > copies of the object should also be inconsistent? Should I just try to > > figure which object/bucket this belongs to and delete it/copy it again to > > the ceph cluster? > > > > Also, do you know what the error message means? is it just some sort of > > metadata for this object that isn't correct, not the object itself? > > > > On Wed, Dec 10, 2014 at 11:11 AM, Luis Periquito > > wrote: > >> > >> Hi, > >> > >> In the last few days this PG (pool is .rgw.buckets) has been in error > >> after running the scrub process. > >> > >> After getting the error, and trying to see what may be the issue (and > >> finding none), I've just issued a ceph repair followed by a ceph > deep-scrub. > >> However it doesn't seem to have fixed the issue and it still remains. > >> > >> The relevant log from the OSD is as follows. > >> > >> 2014-12-10 09:38:09.348110 7f8f618be700 0 log [ERR] : 9.180 deep-scrub > 0 > >> missing, 1 inconsistent objects > >> 2014-12-10 09:38:09.348116 7f8f618be700 0 log [ERR] : 9.180 deep-scrub > 1 > >> errors > >> 2014-12-10 10:13:15.922065 7f8f618be700 0 log [INF] : 9.180 repair ok, > 0 > >> fixed > >> 2014-12-10 10:55:27.556358 7f8f618be700 0 log [ERR] : 9.180 shard 6: > soid > >> 370cbf80/29145.4_xxx/head//9 missing attr _, missing attr _user.rgw.acl, > >> missing attr _user.rgw.content_type, missing attr _user.rgw.etag, > missing > >> attr _user.rgw.idtag, missing attr _user.rgw.manifest, missing attr > >> _user.rgw.x-amz-meta-md5sum, missing attr _user.rgw.x-amz-meta-stat, > missing > >> attr snapset > >> 2014-12-10 10:56:50.597952 7f8f618be700 0 log [ERR] : 9.180 deep-scrub > 0 > >> missing, 1 inconsistent objects > >> 2014-12-10 10:56:50.597957 7f8f618be700 0 log [ERR] : 9.180 deep-scrub > 1 > >> errors > >> > >> I'm running version firefly 0.80.7. > > > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] unable to repair PG
Be very careful with running "ceph pg repair". Have a look at this thread: http://thread.gmane.org/gmane.comp.file-systems.ceph.user/15185 -- Tomasz Kuzemko tomasz.kuze...@ovh.net On Thu, Dec 11, 2014 at 10:57:22AM +, Luis Periquito wrote: > Hi, > > I've stopped OSD.16, removed the PG from the local filesystem and started > the OSD again. After ceph rebuilt the PG in the removed OSD I ran a > deep-scrub and the PG is still inconsistent. > > I'm running out of ideas on trying to solve this. Does this mean that all > copies of the object should also be inconsistent? Should I just try to > figure which object/bucket this belongs to and delete it/copy it again to > the ceph cluster? > > Also, do you know what the error message means? is it just some sort of > metadata for this object that isn't correct, not the object itself? > > On Wed, Dec 10, 2014 at 11:11 AM, Luis Periquito > wrote: > > > Hi, > > > > In the last few days this PG (pool is .rgw.buckets) has been in error > > after running the scrub process. > > > > After getting the error, and trying to see what may be the issue (and > > finding none), I've just issued a ceph repair followed by a ceph > > deep-scrub. However it doesn't seem to have fixed the issue and it still > > remains. > > > > The relevant log from the OSD is as follows. > > > > 2014-12-10 09:38:09.348110 7f8f618be700 0 log [ERR] : 9.180 deep-scrub 0 > > missing, 1 inconsistent objects > > 2014-12-10 09:38:09.348116 7f8f618be700 0 log [ERR] : 9.180 deep-scrub 1 > > errors > > 2014-12-10 10:13:15.922065 7f8f618be700 0 log [INF] : 9.180 repair ok, 0 > > fixed > > 2014-12-10 10:55:27.556358 7f8f618be700 0 log [ERR] : 9.180 shard 6: soid > > 370cbf80/29145.4_xxx/head//9 missing attr _, missing attr _user.rgw.acl, > > missing attr _user.rgw.content_type, missing attr _user.rgw.etag, missing > > attr _user.rgw.idtag, missing attr _user.rgw.manifest, missing attr > > _user.rgw.x-amz-meta-md5sum, missing attr _user.rgw.x-amz-meta-stat, > > missing attr snapset > > 2014-12-10 10:56:50.597952 7f8f618be700 0 log [ERR] : 9.180 deep-scrub 0 > > missing, 1 inconsistent objects > > 2014-12-10 10:56:50.597957 7f8f618be700 0 log [ERR] : 9.180 deep-scrub 1 > > errors > > > > I'm running version firefly 0.80.7. > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com signature.asc Description: Digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] unable to repair PG
On Thu, Dec 11, 2014 at 2:57 AM, Luis Periquito wrote: > Hi, > > I've stopped OSD.16, removed the PG from the local filesystem and started > the OSD again. After ceph rebuilt the PG in the removed OSD I ran a > deep-scrub and the PG is still inconsistent. What led you to remove it from osd 16? Is that the one hosting the log you snipped from? Is osd 16 the one hosting shard 6 of that PG, or was it the primary? Anyway, the message means that shard 6 (which I think is the seventh OSD in the list) of PG 9.180 is missing a bunch of xattrs on object 370cbf80/29145.4_xxx/head//9. I'm actually a little surprised it didn't crash if it's missing the "_" attr -Greg > > I'm running out of ideas on trying to solve this. Does this mean that all > copies of the object should also be inconsistent? Should I just try to > figure which object/bucket this belongs to and delete it/copy it again to > the ceph cluster? > > Also, do you know what the error message means? is it just some sort of > metadata for this object that isn't correct, not the object itself? > > On Wed, Dec 10, 2014 at 11:11 AM, Luis Periquito > wrote: >> >> Hi, >> >> In the last few days this PG (pool is .rgw.buckets) has been in error >> after running the scrub process. >> >> After getting the error, and trying to see what may be the issue (and >> finding none), I've just issued a ceph repair followed by a ceph deep-scrub. >> However it doesn't seem to have fixed the issue and it still remains. >> >> The relevant log from the OSD is as follows. >> >> 2014-12-10 09:38:09.348110 7f8f618be700 0 log [ERR] : 9.180 deep-scrub 0 >> missing, 1 inconsistent objects >> 2014-12-10 09:38:09.348116 7f8f618be700 0 log [ERR] : 9.180 deep-scrub 1 >> errors >> 2014-12-10 10:13:15.922065 7f8f618be700 0 log [INF] : 9.180 repair ok, 0 >> fixed >> 2014-12-10 10:55:27.556358 7f8f618be700 0 log [ERR] : 9.180 shard 6: soid >> 370cbf80/29145.4_xxx/head//9 missing attr _, missing attr _user.rgw.acl, >> missing attr _user.rgw.content_type, missing attr _user.rgw.etag, missing >> attr _user.rgw.idtag, missing attr _user.rgw.manifest, missing attr >> _user.rgw.x-amz-meta-md5sum, missing attr _user.rgw.x-amz-meta-stat, missing >> attr snapset >> 2014-12-10 10:56:50.597952 7f8f618be700 0 log [ERR] : 9.180 deep-scrub 0 >> missing, 1 inconsistent objects >> 2014-12-10 10:56:50.597957 7f8f618be700 0 log [ERR] : 9.180 deep-scrub 1 >> errors >> >> I'm running version firefly 0.80.7. > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] unable to repair PG
Hi, I've stopped OSD.16, removed the PG from the local filesystem and started the OSD again. After ceph rebuilt the PG in the removed OSD I ran a deep-scrub and the PG is still inconsistent. I'm running out of ideas on trying to solve this. Does this mean that all copies of the object should also be inconsistent? Should I just try to figure which object/bucket this belongs to and delete it/copy it again to the ceph cluster? Also, do you know what the error message means? is it just some sort of metadata for this object that isn't correct, not the object itself? On Wed, Dec 10, 2014 at 11:11 AM, Luis Periquito wrote: > Hi, > > In the last few days this PG (pool is .rgw.buckets) has been in error > after running the scrub process. > > After getting the error, and trying to see what may be the issue (and > finding none), I've just issued a ceph repair followed by a ceph > deep-scrub. However it doesn't seem to have fixed the issue and it still > remains. > > The relevant log from the OSD is as follows. > > 2014-12-10 09:38:09.348110 7f8f618be700 0 log [ERR] : 9.180 deep-scrub 0 > missing, 1 inconsistent objects > 2014-12-10 09:38:09.348116 7f8f618be700 0 log [ERR] : 9.180 deep-scrub 1 > errors > 2014-12-10 10:13:15.922065 7f8f618be700 0 log [INF] : 9.180 repair ok, 0 > fixed > 2014-12-10 10:55:27.556358 7f8f618be700 0 log [ERR] : 9.180 shard 6: soid > 370cbf80/29145.4_xxx/head//9 missing attr _, missing attr _user.rgw.acl, > missing attr _user.rgw.content_type, missing attr _user.rgw.etag, missing > attr _user.rgw.idtag, missing attr _user.rgw.manifest, missing attr > _user.rgw.x-amz-meta-md5sum, missing attr _user.rgw.x-amz-meta-stat, > missing attr snapset > 2014-12-10 10:56:50.597952 7f8f618be700 0 log [ERR] : 9.180 deep-scrub 0 > missing, 1 inconsistent objects > 2014-12-10 10:56:50.597957 7f8f618be700 0 log [ERR] : 9.180 deep-scrub 1 > errors > > I'm running version firefly 0.80.7. > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] unable to repair PG
Hi, In the last few days this PG (pool is .rgw.buckets) has been in error after running the scrub process. After getting the error, and trying to see what may be the issue (and finding none), I've just issued a ceph repair followed by a ceph deep-scrub. However it doesn't seem to have fixed the issue and it still remains. The relevant log from the OSD is as follows. 2014-12-10 09:38:09.348110 7f8f618be700 0 log [ERR] : 9.180 deep-scrub 0 missing, 1 inconsistent objects 2014-12-10 09:38:09.348116 7f8f618be700 0 log [ERR] : 9.180 deep-scrub 1 errors 2014-12-10 10:13:15.922065 7f8f618be700 0 log [INF] : 9.180 repair ok, 0 fixed 2014-12-10 10:55:27.556358 7f8f618be700 0 log [ERR] : 9.180 shard 6: soid 370cbf80/29145.4_xxx/head//9 missing attr _, missing attr _user.rgw.acl, missing attr _user.rgw.content_type, missing attr _user.rgw.etag, missing attr _user.rgw.idtag, missing attr _user.rgw.manifest, missing attr _user.rgw.x-amz-meta-md5sum, missing attr _user.rgw.x-amz-meta-stat, missing attr snapset 2014-12-10 10:56:50.597952 7f8f618be700 0 log [ERR] : 9.180 deep-scrub 0 missing, 1 inconsistent objects 2014-12-10 10:56:50.597957 7f8f618be700 0 log [ERR] : 9.180 deep-scrub 1 errors I'm running version firefly 0.80.7. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com