On Mon, Oct 9, 2017 at 1:59 PM, Ryan Leimenstoll
<rleim...@umiacs.umd.edu> wrote:
> Hi all,
>
> We recently upgraded to Ceph 12.2.1 (Luminous) from 12.2.0 however are now 
> seeing issues running radosgw. Specifically, it appears an automatically 
> triggered resharding operation won’t end, despite the jobs being cancelled 
> (radosgw-admin reshard cancel). I have also disabled dynamic sharding for the 
> time being in the ceph.conf.
>
>
> [root@objproxy02 ~]# radosgw-admin reshard list
> []
>
> The two buckets were also reported in the `radosgw-admin reshard list` before 
> our RGW frontends paused recently (and only came back after a service 
> restart). These two buckets cannot currently be written to at this point 
> either.
>
> 2017-10-06 22:41:19.547260 7f90506e9700 0 block_while_resharding ERROR: 
> bucket is still resharding, please retry
> 2017-10-06 22:41:19.547411 7f90506e9700 0 WARNING: set_req_state_err 
> err_no=2300 resorting to 500
> 2017-10-06 22:41:19.547729 7f90506e9700 0 ERROR: 
> RESTFUL_IO(s)->complete_header() returned err=Input/output error
> 2017-10-06 22:41:19.548570 7f90506e9700 1 ====== req done req=0x7f90506e3180 
> op status=-2300 http_status=500 ======
> 2017-10-06 22:41:19.548790 7f90506e9700 1 civetweb: 0x55766d111000: 
> $MY_IP_HERE$ - - [06/Oct/2017:22:33:47 -0400] "PUT /
> $REDACTED_BUCKET_NAME$/$REDACTED_KEY_NAME$ HTTP/1.1" 1 0 - Boto3/1.4.7 
> Python/2.7.12 Linux/4.9.43-17.3
> 9.amzn1.x86_64 exec-env/AWS_Lambda_python2.7 Botocore/1.7.2 Resource
> [.. slightly later in the logs..]
> 2017-10-06 22:41:53.516272 7f90406c9700 1 rgw realm reloader: Frontends paused
> 2017-10-06 22:41:53.528703 7f907893f700 0 ERROR: failed to clone shard, 
> completion_mgr.get_next() returned ret=-125
> 2017-10-06 22:44:32.049564 7f9074136700 0 ERROR: keystone revocation 
> processing returned error r=-22
> 2017-10-06 22:59:32.059222 7f9074136700 0 ERROR: keystone revocation 
> processing returned error r=-22
>
> Can anyone advise on the best path forward to stop the current sharding 
> states and avoid this moving forward?
>

What does 'radosgw-admin reshard status --bucket=<bucket>' return?
I think just manually resharding the buckets should clear this flag,
is that not an option?
manual reshard: radosgw-admin bucket reshard --bucket=<bucket>
--num-shards=<num>

also, the 'radosgw-admin bucket check --fix' might clear that flag.

For some reason it seems that the reshard cancellation code is not
clearing that flag on the bucket index header (pretty sure it used to
do it at one point). I'll open a tracker ticket.

Thanks,
Yehuda

>
> Some other details:
>  - 3 rgw instances
>  - Ceph Luminous 12.2.1
>  - 584 active OSDs, rgw bucket index is on Intel NVMe OSDs
>
>
> Thanks,
> Ryan Leimenstoll
> rleim...@umiacs.umd.edu
> University of Maryland Institute for Advanced Computer Studies
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to