Re: [ceph-users] FW: radosgw: stale/leaked bucket index entries

2017-06-22 Thread Pavan Rallabhandi
Looks like I’ve now got a consistent repro scenario, please find the gory 
details here http://tracker.ceph.com/issues/20380

Thanks!

On 20/06/17, 2:04 PM, "Pavan Rallabhandi" <prallabha...@walmartlabs.com> wrote:

Hi Orit,

No, we do not use multi-site.

Thanks,
-Pavan.

From: Orit Wasserman <owass...@redhat.com>
Date: Tuesday, 20 June 2017 at 12:49 PM
To: Pavan Rallabhandi <prallabha...@walmartlabs.com>
Cc: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
    Subject: EXT: Re: [ceph-users] FW: radosgw: stale/leaked bucket index 
entries

Hi Pavan, 

On Tue, Jun 20, 2017 at 8:29 AM, Pavan Rallabhandi 
<prallabha...@walmartlabs.com> wrote:
Trying one more time with ceph-users

On 19/06/17, 11:07 PM, "Pavan Rallabhandi" <prallabha...@walmartlabs.com> 
wrote:

On many of our clusters running Jewel (10.2.5+), am running into a 
strange problem of having stale bucket index entries left over for (some of 
the) objects deleted. Though it is not reproducible at will, it has been pretty 
consistent of late and am clueless at this point for the possible reasons to 
happen so.

The symptoms are that the actual delete operation of an object is 
reported successful in the RGW logs, but a bucket list on the container would 
still show the deleted object. An attempt to download/stat of the object 
appropriately results in a failure. No failures are seen in the respective OSDs 
where the bucket index object is located. And rebuilding the bucket index by 
running ‘radosgw-admin bucket check –fix’ would fix the issue.

Though I could simulate the problem by instrumenting the code, to not 
to have invoked `complete_del` on the bucket index op 
https://github.com/ceph/ceph/blob/master/src/rgw/rgw_rados.cc#L8793, but that 
call is always seem to be made unless there is a cascading error from the 
actual delete operation of the object, which doesn’t seem to be the case here.

I wanted to know the possible reasons where the bucket index would be 
left in such limbo, any pointers would be much appreciated. FWIW, we are not 
sharding the buckets and very recently I’ve seen this happen with buckets 
having as low as
< 10 objects, and we are using swift for all the operations.

Do you use multisite? 

Regards,
Orit
 
Thanks,
-Pavan.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] FW: radosgw: stale/leaked bucket index entries

2017-06-20 Thread Pavan Rallabhandi
Hi Orit,

No, we do not use multi-site.

Thanks,
-Pavan.

From: Orit Wasserman <owass...@redhat.com>
Date: Tuesday, 20 June 2017 at 12:49 PM
To: Pavan Rallabhandi <prallabha...@walmartlabs.com>
Cc: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
Subject: EXT: Re: [ceph-users] FW: radosgw: stale/leaked bucket index entries

Hi Pavan, 

On Tue, Jun 20, 2017 at 8:29 AM, Pavan Rallabhandi 
<prallabha...@walmartlabs.com> wrote:
Trying one more time with ceph-users

On 19/06/17, 11:07 PM, "Pavan Rallabhandi" <prallabha...@walmartlabs.com> wrote:

    On many of our clusters running Jewel (10.2.5+), am running into a strange 
problem of having stale bucket index entries left over for (some of the) 
objects deleted. Though it is not reproducible at will, it has been pretty 
consistent of late and am clueless at this point for the possible reasons to 
happen so.

    The symptoms are that the actual delete operation of an object is reported 
successful in the RGW logs, but a bucket list on the container would still show 
the deleted object. An attempt to download/stat of the object appropriately 
results in a failure. No failures are seen in the respective OSDs where the 
bucket index object is located. And rebuilding the bucket index by running 
‘radosgw-admin bucket check –fix’ would fix the issue.

    Though I could simulate the problem by instrumenting the code, to not to 
have invoked `complete_del` on the bucket index op 
https://github.com/ceph/ceph/blob/master/src/rgw/rgw_rados.cc#L8793, but that 
call is always seem to be made unless there is a cascading error from the 
actual delete operation of the object, which doesn’t seem to be the case here.

    I wanted to know the possible reasons where the bucket index would be left 
in such limbo, any pointers would be much appreciated. FWIW, we are not 
sharding the buckets and very recently I’ve seen this happen with buckets 
having as low as
    < 10 objects, and we are using swift for all the operations.

Do you use multisite? 

Regards,
Orit
 
    Thanks,
    -Pavan.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] FW: radosgw: stale/leaked bucket index entries

2017-06-20 Thread Orit Wasserman
Hi Pavan,

On Tue, Jun 20, 2017 at 8:29 AM, Pavan Rallabhandi <
prallabha...@walmartlabs.com> wrote:

> Trying one more time with ceph-users
>
> On 19/06/17, 11:07 PM, "Pavan Rallabhandi" 
> wrote:
>
> On many of our clusters running Jewel (10.2.5+), am running into a
> strange problem of having stale bucket index entries left over for (some of
> the) objects deleted. Though it is not reproducible at will, it has been
> pretty consistent of late and am clueless at this point for the possible
> reasons to happen so.
>
> The symptoms are that the actual delete operation of an object is
> reported successful in the RGW logs, but a bucket list on the container
> would still show the deleted object. An attempt to download/stat of the
> object appropriately results in a failure. No failures are seen in the
> respective OSDs where the bucket index object is located. And rebuilding
> the bucket index by running ‘radosgw-admin bucket check –fix’ would fix the
> issue.
>
> Though I could simulate the problem by instrumenting the code, to not
> to have invoked `complete_del` on the bucket index op
> https://github.com/ceph/ceph/blob/master/src/rgw/rgw_rados.cc#L8793, but
> that call is always seem to be made unless there is a cascading error from
> the actual delete operation of the object, which doesn’t seem to be the
> case here.
>
> I wanted to know the possible reasons where the bucket index would be
> left in such limbo, any pointers would be much appreciated. FWIW, we are
> not sharding the buckets and very recently I’ve seen this happen with
> buckets having as low as
> < 10 objects, and we are using swift for all the operations.
>
>
Do you use multisite?

Regards,
Orit


> Thanks,
> -Pavan.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com