Thanks for the response Yehuda.
Staus:
[root@objproxy02 UMobjstore]# radosgw-admin reshard status —bucket=$bucket_name
[
{
"reshard_status": 1,
"new_bucket_instance_id":
"8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.47370206.1",
"num_shards": 4
}
]
I cleared the flag using the bucket check —fix command and will keep an eye on
that tracker issue.
Do you have any insight into why the RGWs ultimately paused/reloaded and failed
to come back? I am happy to provide more information that could assist. At the
moment we are somewhat nervous to reenable dynamic sharding as it seems to have
contributed to this problem.
Thanks,
Ryan
> On Oct 9, 2017, at 5:26 PM, Yehuda Sadeh-Weinraub <[email protected]> wrote:
>
> On Mon, Oct 9, 2017 at 1:59 PM, Ryan Leimenstoll
> <[email protected]> wrote:
>> Hi all,
>>
>> We recently upgraded to Ceph 12.2.1 (Luminous) from 12.2.0 however are now
>> seeing issues running radosgw. Specifically, it appears an automatically
>> triggered resharding operation won’t end, despite the jobs being cancelled
>> (radosgw-admin reshard cancel). I have also disabled dynamic sharding for
>> the time being in the ceph.conf.
>>
>>
>> [root@objproxy02 ~]# radosgw-admin reshard list
>> []
>>
>> The two buckets were also reported in the `radosgw-admin reshard list`
>> before our RGW frontends paused recently (and only came back after a service
>> restart). These two buckets cannot currently be written to at this point
>> either.
>>
>> 2017-10-06 22:41:19.547260 7f90506e9700 0 block_while_resharding ERROR:
>> bucket is still resharding, please retry
>> 2017-10-06 22:41:19.547411 7f90506e9700 0 WARNING: set_req_state_err
>> err_no=2300 resorting to 500
>> 2017-10-06 22:41:19.547729 7f90506e9700 0 ERROR:
>> RESTFUL_IO(s)->complete_header() returned err=Input/output error
>> 2017-10-06 22:41:19.548570 7f90506e9700 1 ====== req done req=0x7f90506e3180
>> op status=-2300 http_status=500 ======
>> 2017-10-06 22:41:19.548790 7f90506e9700 1 civetweb: 0x55766d111000:
>> $MY_IP_HERE$ - - [06/Oct/2017:22:33:47 -0400] "PUT /
>> $REDACTED_BUCKET_NAME$/$REDACTED_KEY_NAME$ HTTP/1.1" 1 0 - Boto3/1.4.7
>> Python/2.7.12 Linux/4.9.43-17.3
>> 9.amzn1.x86_64 exec-env/AWS_Lambda_python2.7 Botocore/1.7.2 Resource
>> [.. slightly later in the logs..]
>> 2017-10-06 22:41:53.516272 7f90406c9700 1 rgw realm reloader: Frontends
>> paused
>> 2017-10-06 22:41:53.528703 7f907893f700 0 ERROR: failed to clone shard,
>> completion_mgr.get_next() returned ret=-125
>> 2017-10-06 22:44:32.049564 7f9074136700 0 ERROR: keystone revocation
>> processing returned error r=-22
>> 2017-10-06 22:59:32.059222 7f9074136700 0 ERROR: keystone revocation
>> processing returned error r=-22
>>
>> Can anyone advise on the best path forward to stop the current sharding
>> states and avoid this moving forward?
>>
>
> What does 'radosgw-admin reshard status --bucket=<bucket>' return?
> I think just manually resharding the buckets should clear this flag,
> is that not an option?
> manual reshard: radosgw-admin bucket reshard --bucket=<bucket>
> --num-shards=<num>
>
> also, the 'radosgw-admin bucket check --fix' might clear that flag.
>
> For some reason it seems that the reshard cancellation code is not
> clearing that flag on the bucket index header (pretty sure it used to
> do it at one point). I'll open a tracker ticket.
>
> Thanks,
> Yehuda
>
>>
>> Some other details:
>> - 3 rgw instances
>> - Ceph Luminous 12.2.1
>> - 584 active OSDs, rgw bucket index is on Intel NVMe OSDs
>>
>>
>> Thanks,
>> Ryan Leimenstoll
>> [email protected]
>> University of Maryland Institute for Advanced Computer Studies
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com