Thanks for the response Yehuda. 

Staus:
[root@objproxy02 UMobjstore]# radosgw-admin reshard status —bucket=$bucket_name
[
    {
        "reshard_status": 1,
        "new_bucket_instance_id": 
"8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.47370206.1",
        "num_shards": 4
    }
]

I cleared the flag using the bucket check —fix command and will keep an eye on 
that tracker issue. 

Do you have any insight into why the RGWs ultimately paused/reloaded and failed 
to come back? I am happy to provide more information that could assist. At the 
moment we are somewhat nervous to reenable dynamic sharding as it seems to have 
contributed to this problem. 

Thanks,
Ryan



> On Oct 9, 2017, at 5:26 PM, Yehuda Sadeh-Weinraub <[email protected]> wrote:
> 
> On Mon, Oct 9, 2017 at 1:59 PM, Ryan Leimenstoll
> <[email protected]> wrote:
>> Hi all,
>> 
>> We recently upgraded to Ceph 12.2.1 (Luminous) from 12.2.0 however are now 
>> seeing issues running radosgw. Specifically, it appears an automatically 
>> triggered resharding operation won’t end, despite the jobs being cancelled 
>> (radosgw-admin reshard cancel). I have also disabled dynamic sharding for 
>> the time being in the ceph.conf.
>> 
>> 
>> [root@objproxy02 ~]# radosgw-admin reshard list
>> []
>> 
>> The two buckets were also reported in the `radosgw-admin reshard list` 
>> before our RGW frontends paused recently (and only came back after a service 
>> restart). These two buckets cannot currently be written to at this point 
>> either.
>> 
>> 2017-10-06 22:41:19.547260 7f90506e9700 0 block_while_resharding ERROR: 
>> bucket is still resharding, please retry
>> 2017-10-06 22:41:19.547411 7f90506e9700 0 WARNING: set_req_state_err 
>> err_no=2300 resorting to 500
>> 2017-10-06 22:41:19.547729 7f90506e9700 0 ERROR: 
>> RESTFUL_IO(s)->complete_header() returned err=Input/output error
>> 2017-10-06 22:41:19.548570 7f90506e9700 1 ====== req done req=0x7f90506e3180 
>> op status=-2300 http_status=500 ======
>> 2017-10-06 22:41:19.548790 7f90506e9700 1 civetweb: 0x55766d111000: 
>> $MY_IP_HERE$ - - [06/Oct/2017:22:33:47 -0400] "PUT /
>> $REDACTED_BUCKET_NAME$/$REDACTED_KEY_NAME$ HTTP/1.1" 1 0 - Boto3/1.4.7 
>> Python/2.7.12 Linux/4.9.43-17.3
>> 9.amzn1.x86_64 exec-env/AWS_Lambda_python2.7 Botocore/1.7.2 Resource
>> [.. slightly later in the logs..]
>> 2017-10-06 22:41:53.516272 7f90406c9700 1 rgw realm reloader: Frontends 
>> paused
>> 2017-10-06 22:41:53.528703 7f907893f700 0 ERROR: failed to clone shard, 
>> completion_mgr.get_next() returned ret=-125
>> 2017-10-06 22:44:32.049564 7f9074136700 0 ERROR: keystone revocation 
>> processing returned error r=-22
>> 2017-10-06 22:59:32.059222 7f9074136700 0 ERROR: keystone revocation 
>> processing returned error r=-22
>> 
>> Can anyone advise on the best path forward to stop the current sharding 
>> states and avoid this moving forward?
>> 
> 
> What does 'radosgw-admin reshard status --bucket=<bucket>' return?
> I think just manually resharding the buckets should clear this flag,
> is that not an option?
> manual reshard: radosgw-admin bucket reshard --bucket=<bucket>
> --num-shards=<num>
> 
> also, the 'radosgw-admin bucket check --fix' might clear that flag.
> 
> For some reason it seems that the reshard cancellation code is not
> clearing that flag on the bucket index header (pretty sure it used to
> do it at one point). I'll open a tracker ticket.
> 
> Thanks,
> Yehuda
> 
>> 
>> Some other details:
>> - 3 rgw instances
>> - Ceph Luminous 12.2.1
>> - 584 active OSDs, rgw bucket index is on Intel NVMe OSDs
>> 
>> 
>> Thanks,
>> Ryan Leimenstoll
>> [email protected]
>> University of Maryland Institute for Advanced Computer Studies
>> 
>> 
>> 
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to