We’ve hit this issue as well last night. If it wasn’t for a backup of the file that vanished, it would have been lost.
> On 13 Nov 2020, at 18:14, Eric Ivancich <ivanc...@redhat.com> wrote: > > I have some questions for those who’ve experienced this issue. > > 1. It seems like those reporting this issue are seeing it strictly after > upgrading to Octopus. From what version did each of these sites upgrade to > Octopus? From Nautilus? Mimic? Luminous? We were running Mimic, then upgraded to Nautilus then Octopus in short succession. > > 2. Does anyone have any lifecycle rules on a bucket experiencing this issue? > If so, please describe. The bucket where we hit this yesterday doesn’t have lifecycle rules, nor object versioning. > > 3. Is anyone making copies of the affected objects (to same or to a different > bucket) prior to seeing the issue? And if they are making copies, does the > destination bucket have lifecycle rules? And if they are making copies, are > those copies ever being removed? The object, a Docker layer, that went missing has not been touched in 2 months. It worked for a while, but then suddenly went missing. > > Thanks, > > Eric > > >> On Nov 12, 2020, at 4:54 PM, huxia...@horebdata.cn >> <mailto:huxia...@horebdata.cn> wrote: >> >> Looks like this is a very dangerous bug for data safety. Hope the bug would >> be quickly identified and fixed. >> >> best regards, >> >> Samuel >> >> >> >> huxia...@horebdata.cn <mailto:huxia...@horebdata.cn> >> <mailto:huxia...@horebdata.cn <mailto:huxia...@horebdata.cn>> >> >> From: Janek Bevendorff >> Date: 2020-11-12 18:17 >> To: huxia...@horebdata.cn <mailto:huxia...@horebdata.cn> >> <mailto:huxia...@horebdata.cn <mailto:huxia...@horebdata.cn>>; EDH - Manuel >> Rios; Rafael Lopez >> CC: Robin H. Johnson; ceph-users >> Subject: Re: [ceph-users] Re: NoSuchKey on key that is visible in s3 >> list/radosgw bk >> I have never seen this on Luminous. I recently upgraded to Octopus and the >> issue started occurring only few weeks later. >> >> On 12/11/2020 16:37, huxia...@horebdata.cn wrote: >> which Ceph versions are affected by this RGW bug/issues? Luminous, Mimic, >> Octupos, or the latest? >> >> any idea? >> >> samuel >> >> >> >> huxia...@horebdata.cn >> >> From: EDH - Manuel Rios >> Date: 2020-11-12 14:27 >> To: Janek Bevendorff; Rafael Lopez >> CC: Robin H. Johnson; ceph-users >> Subject: [ceph-users] Re: NoSuchKey on key that is visible in s3 >> list/radosgw bk >> This same error caused us to wipe a full cluster of 300TB... will be related >> to some rados index/database bug not to s3. >> >> As Janek exposed is a mayor issue, because the error silent happend and you >> can only detect it with S3, when you're going to delete/purge a S3 bucket. >> Dropping NoSuchKey. Error is not related to S3 logic .. >> >> Hope this time dev's can take enought time to find and resolve the issue. >> Error happens with low ec profiles, even with replica x3 in some cases. >> >> Regards >> >> >> >> -----Mensaje original----- >> De: Janek Bevendorff <janek.bevendo...@uni-weimar.de >> <mailto:janek.bevendo...@uni-weimar.de> >> <mailto:janek.bevendo...@uni-weimar.de >> <mailto:janek.bevendo...@uni-weimar.de>>> >> Enviado el: jueves, 12 de noviembre de 2020 14:06 >> Para: Rafael Lopez <rafael.lo...@monash.edu <mailto:rafael.lo...@monash.edu> >> <mailto:rafael.lo...@monash.edu <mailto:rafael.lo...@monash.edu>>> >> CC: Robin H. Johnson <robb...@gentoo.org <mailto:robb...@gentoo.org> >> <mailto:robb...@gentoo.org <mailto:robb...@gentoo.org>>>; ceph-users >> <ceph-users@ceph.io <mailto:ceph-users@ceph.io> <mailto:ceph-users@ceph.io >> <mailto:ceph-users@ceph.io>>> >> Asunto: [ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw >> bk >> >> Here is a bug report concerning (probably) this exact issue: >> https://tracker.ceph.com/issues/47866 >> <https://tracker.ceph.com/issues/47866> >> <https://tracker.ceph.com/issues/47866 >> <https://tracker.ceph.com/issues/47866>> >> >> I left a comment describing the situation and my (limited) experiences with >> it. >> >> >> On 11/11/2020 10:04, Janek Bevendorff wrote: >>> >>> Yeah, that seems to be it. There are 239 objects prefixed >>> .8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh in my dump. However, there are none >>> of the multiparts from the other file to be found and the head object >>> is 0 bytes. >>> >>> I checked another multipart object with an end pointer of 11. >>> Surprisingly, it had way more than 11 parts (39 to be precise) named >>> .1, .1_1 .1_2, .1_3, etc. Not sure how Ceph identifies those, but I >>> could find them in the dump at least. >>> >>> I have no idea why the objects disappeared. I ran a Spark job over all >>> buckets, read 1 byte of every object and recorded errors. Of the 78 >>> buckets, two are missing objects. One bucket is missing one object, >>> the other 15. So, luckily, the incidence is still quite low, but the >>> problem seems to be expanding slowly. >>> >>> >>> On 10/11/2020 23:46, Rafael Lopez wrote: >>>> Hi Janek, >>>> >>>> What you said sounds right - an S3 single part obj won't have an S3 >>>> multipart string as part of the prefix. S3 multipart string looks >>>> like "2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme". >>>> >>>> From memory, single part S3 objects that don't fit in a single rados >>>> object are assigned a random prefix that has nothing to do with >>>> the object name, and the rados tail/data objects (not the head >>>> object) have that prefix. >>>> As per your working example, the prefix for that would be >>>> '.8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh'. So there would be (239) "shadow" >>>> objects with names containing that prefix, and if you add up the >>>> sizes it should be the size of your S3 object. >>>> >>>> You should look at working and non working examples of both single >>>> and multipart S3 objects, as they are probably all a bit different >>>> when you look in rados. >>>> >>>> I agree it is a serious issue, because once objects are no longer in >>>> rados, they cannot be recovered. If it was a case that there was a >>>> link broken or rados objects renamed, then we could work to >>>> recover...but as far as I can tell, it looks like stuff is just >>>> vanishing from rados. The only explanation I can think of is some >>>> (rgw or rados) background process is incorrectly doing something with >>>> these objects (eg. renaming/deleting). I had thought perhaps it was a >>>> bug with the rgw garbage collector..but that is pure speculation. >>>> >>>> Once you can articulate the problem, I'd recommend logging a bug >>>> tracker upstream. >>>> >>>> >>>> On Wed, 11 Nov 2020 at 06:33, Janek Bevendorff >>>> <janek.bevendo...@uni-weimar.de <mailto:janek.bevendo...@uni-weimar.de> >>>> <mailto:janek.bevendo...@uni-weimar.de >>>> <mailto:janek.bevendo...@uni-weimar.de>> >>>> <mailto:janek.bevendo...@uni-weimar.de >>>> <mailto:janek.bevendo...@uni-weimar.de> >>>> <mailto:janek.bevendo...@uni-weimar.de >>>> <mailto:janek.bevendo...@uni-weimar.de>>>> wrote: >>>> >>>> Here's something else I noticed: when I stat objects that work >>>> via radosgw-admin, the stat info contains a "begin_iter" JSON >>>> object with RADOS key info like this >>>> >>>> >>>> "key": { >>>> "name": >>>> >>>> "29/items/WIDE-20110924034843-crawl420/WIDE-20110924065228-02544.warc.gz", >>>> "instance": "", >>>> "ns": "" >>>> } >>>> >>>> >>>> and then "end_iter" with key info like this: >>>> >>>> >>>> "key": { >>>> "name": >>>> ".8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh_239", >>>> "instance": "", >>>> "ns": "shadow" >>>> } >>>> >>>> However, when I check the broken 0-byte object, the "begin_iter" >>>> and "end_iter" keys look like this: >>>> >>>> >>>> "key": { >>>> "name": >>>> >>>> "29/items/WIDE-20110903143858-crawl428/WIDE-20110903143858-01166.warc.gz.2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme.1", >>>> "instance": "", >>>> "ns": "multipart" >>>> } >>>> >>>> [...] >>>> >>>> >>>> "key": { >>>> "name": >>>> >>>> "29/items/WIDE-20110903143858-crawl428/WIDE-20110903143858-01166.warc.gz.2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme.19", >>>> "instance": "", >>>> "ns": "multipart" >>>> } >>>> >>>> So, it's the full name plus a suffix and the namespace is >>>> multipart, not shadow (or empty). This in itself may just be an >>>> artefact of whether the object was uploaded in one go or as a >>>> multipart object, but the second difference is that I cannot find >>>> any of the multipart objects in my pool's object name dump. I >>>> can, however, find the shadow RADOS object of the intact S3 object. >>>> >>>> >>>> >>>> >>>> -- >>>> *Rafael Lopez* >>>> Devops Systems Engineer >>>> Monash University eResearch Centre >>>> >>>> T: +61 3 9905 9118 <tel:%2B61%203%209905%209118> >>>> E: rafael.lo...@monash.edu <mailto:rafael.lo...@monash.edu> >>>> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io> > To unsubscribe send an email to ceph-users-le...@ceph.io > <mailto:ceph-users-le...@ceph.io> _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io