Re: [ceph-users] radosgw multizone not syncing large bucket completly to other zone

2018-07-11 Thread Enrico Kern
I changed the endpoints to bypass the loadbalancers for sync. But the
Problem stil remains. Will probably resetup the bucket and recopy the data
to see if that changes something. I cant make anything out of all the log
messages, need to dig deeper into that

On Sun, Jul 8, 2018 at 4:55 PM Enrico Kern  wrote:

> Hello,
>
> yes we are using haproxy on the secondary zone, but  A10 Hardware
> Loadbalancers on the Master zone. So i suspect there are some timeouts that
> may cause this issue then if having loadbalancers in front of it be the
> problem?
>
> Will check if changing to the ips directly will fix the issue
>
> On Sun, Jul 8, 2018 at 11:51 AM Orit Wasserman 
> wrote:
>
>> Hi Enrico,
>>
>> On Fri, Jun 29, 2018 at 7:50 PM Enrico Kern 
>> wrote:
>>
>>> hmm that also pops up right away when i restart all radosgw instances.
>>> But i will check further and see if i can find something. Maybe doing the
>>> upgrade to mimic too.
>>>
>>> That bucket is basically under load on the master zone all the time as
>>> we use it as historical storage for druid, so there is constantly data
>>> written to it. I just dont get why disabling/enabling sync on the bucket
>>> flawless syncs everything while if i just keep it enabled it stops syncing
>>> at all. For the last days i was just running disabling/enabling for the
>>> bucket in a while loop with 30 minute interval, but thats no persistent fix
>>> ;)
>>>
>>>
>> Are you using Haproxy? we have seen sync stales with it.
>> The simplest work around is to configure the radosgw's addresses as the
>> sync endpoints not the haproxy's.
>>
>> Regards,
>> Orit
>>
>>
>>
>>> On Fri, Jun 29, 2018 at 6:15 PM Yehuda Sadeh-Weinraub 
>>> wrote:
>>>


 On Fri, Jun 29, 2018 at 8:48 AM, Enrico Kern 
 wrote:

> also when i try to sync the bucket manual i get this error:
>
> ERROR: sync.run() returned ret=-16
> 2018-06-29 15:47:50.137268 7f54b7e4ecc0  0 data sync: ERROR: failed to
> read sync status for
> bucketname:6a9448d2-bdba-4bec-aad6-aba72cd8eac6.27150814.1
>
> it works flawless with all other buckets.
>

 error 16 is EBUSY: meaning it can't take a lease to do work on the
 bucket. This usually happens when another entity (e.g., a running radosgw
 process) is working on it at the same time. Either something took the lease
 and never gave it back (leases shouldn't be indefinite, usually are being
 taken for a short period but are renewed periodically), or there might be
 some other bug related to the lease itself. I would start by first figuring
 out whether it's the first case or the second one. On the messenger log
 there should be a message prior to that that shows the operation that got
 the -16 as a response (should have something like "...=-16 (Device or
 resource busy)" in it). The same line would also contain the name of the
 rados object that is used to manage the lease. Try to look at the running
 radosgw log at the same time when this happens, and check whether there are
 other operations on that object.
 One thing to note is that if you run a sync on a bucket and stop it
 uncleanly in the middle (e.g., like killing the process), the leak will
 stay locked for a period of time (Something in the order of 1 to 2 
 minutes).

 Yehuda

>
>
> On Fri, Jun 29, 2018 at 5:39 PM Enrico Kern 
> wrote:
>
>> Hello,
>>
>> thanks for the reply.
>>
>> We have around 200k objects in the bucket. It is not automatic
>> resharded (is that even supported in multisite?)
>>
>> What i see when i run a complete data sync with the debug logs after
>> a while i see alot of informations that it is unable to perform some log
>> and also some device or resource busy (also with alot of different osds,
>> restarting the osds also doesnt make this error going away):
>>
>>
>> 018-06-29 15:18:30.391085 7f38bf882cc0 20
>> cr:s=0x55de55700b20:op=0x55de55717010:20RGWContinuousLeaseCR: couldn't 
>> lock
>> amsterdam.rgw.log:datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.59:sync_lock:
>> retcode=-16
>>
>> 2018-06-29 15:18:30.391094 7f38bf882cc0 20
>> cr:s=0x55de55732750:op=0x55de5572d970:20RGWContinuousLeaseCR: couldn't 
>> lock
>> amsterdam.rgw.log:datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.10:sync_lock:
>> retcode=-16
>>
>> 2018-06-29 15:22:01.618744 7f38ad4c7700  1 -- 10.30.3.67:0/3390890604
>> <== osd.43 10.30.3.44:6800/29982 13272  osd_op_reply(258628
>> datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.52 [call]
>> v14448'24265315 uv24265266 ondisk = -16 ((16) Device or resource busy)) 
>> v8
>>  209+0+0 (2379682838 0 0) 0x7f38a8005110 con 0x7f3868003380
>>
>> 2018-06-29 15:22:01.618829 7f38ad4c7700  1 -- 10.30.3.67:0/3390890604
>> <== osd.43 

Re: [ceph-users] radosgw multizone not syncing large bucket completly to other zone

2018-07-08 Thread Enrico Kern
Hello,

yes we are using haproxy on the secondary zone, but  A10 Hardware
Loadbalancers on the Master zone. So i suspect there are some timeouts that
may cause this issue then if having loadbalancers in front of it be the
problem?

Will check if changing to the ips directly will fix the issue

On Sun, Jul 8, 2018 at 11:51 AM Orit Wasserman  wrote:

> Hi Enrico,
>
> On Fri, Jun 29, 2018 at 7:50 PM Enrico Kern 
> wrote:
>
>> hmm that also pops up right away when i restart all radosgw instances.
>> But i will check further and see if i can find something. Maybe doing the
>> upgrade to mimic too.
>>
>> That bucket is basically under load on the master zone all the time as we
>> use it as historical storage for druid, so there is constantly data written
>> to it. I just dont get why disabling/enabling sync on the bucket flawless
>> syncs everything while if i just keep it enabled it stops syncing at all.
>> For the last days i was just running disabling/enabling for the bucket in a
>> while loop with 30 minute interval, but thats no persistent fix ;)
>>
>>
> Are you using Haproxy? we have seen sync stales with it.
> The simplest work around is to configure the radosgw's addresses as the
> sync endpoints not the haproxy's.
>
> Regards,
> Orit
>
>
>
>> On Fri, Jun 29, 2018 at 6:15 PM Yehuda Sadeh-Weinraub 
>> wrote:
>>
>>>
>>>
>>> On Fri, Jun 29, 2018 at 8:48 AM, Enrico Kern 
>>> wrote:
>>>
 also when i try to sync the bucket manual i get this error:

 ERROR: sync.run() returned ret=-16
 2018-06-29 15:47:50.137268 7f54b7e4ecc0  0 data sync: ERROR: failed to
 read sync status for
 bucketname:6a9448d2-bdba-4bec-aad6-aba72cd8eac6.27150814.1

 it works flawless with all other buckets.

>>>
>>> error 16 is EBUSY: meaning it can't take a lease to do work on the
>>> bucket. This usually happens when another entity (e.g., a running radosgw
>>> process) is working on it at the same time. Either something took the lease
>>> and never gave it back (leases shouldn't be indefinite, usually are being
>>> taken for a short period but are renewed periodically), or there might be
>>> some other bug related to the lease itself. I would start by first figuring
>>> out whether it's the first case or the second one. On the messenger log
>>> there should be a message prior to that that shows the operation that got
>>> the -16 as a response (should have something like "...=-16 (Device or
>>> resource busy)" in it). The same line would also contain the name of the
>>> rados object that is used to manage the lease. Try to look at the running
>>> radosgw log at the same time when this happens, and check whether there are
>>> other operations on that object.
>>> One thing to note is that if you run a sync on a bucket and stop it
>>> uncleanly in the middle (e.g., like killing the process), the leak will
>>> stay locked for a period of time (Something in the order of 1 to 2 minutes).
>>>
>>> Yehuda
>>>


 On Fri, Jun 29, 2018 at 5:39 PM Enrico Kern 
 wrote:

> Hello,
>
> thanks for the reply.
>
> We have around 200k objects in the bucket. It is not automatic
> resharded (is that even supported in multisite?)
>
> What i see when i run a complete data sync with the debug logs after a
> while i see alot of informations that it is unable to perform some log and
> also some device or resource busy (also with alot of different osds,
> restarting the osds also doesnt make this error going away):
>
>
> 018-06-29 15:18:30.391085 7f38bf882cc0 20
> cr:s=0x55de55700b20:op=0x55de55717010:20RGWContinuousLeaseCR: couldn't 
> lock
> amsterdam.rgw.log:datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.59:sync_lock:
> retcode=-16
>
> 2018-06-29 15:18:30.391094 7f38bf882cc0 20
> cr:s=0x55de55732750:op=0x55de5572d970:20RGWContinuousLeaseCR: couldn't 
> lock
> amsterdam.rgw.log:datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.10:sync_lock:
> retcode=-16
>
> 2018-06-29 15:22:01.618744 7f38ad4c7700  1 -- 10.30.3.67:0/3390890604
> <== osd.43 10.30.3.44:6800/29982 13272  osd_op_reply(258628
> datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.52 [call]
> v14448'24265315 uv24265266 ondisk = -16 ((16) Device or resource busy)) v8
>  209+0+0 (2379682838 0 0) 0x7f38a8005110 con 0x7f3868003380
>
> 2018-06-29 15:22:01.618829 7f38ad4c7700  1 -- 10.30.3.67:0/3390890604
> <== osd.43 10.30.3.44:6800/29982 13273  osd_op_reply(258629
> datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.105 [call]
> v14448'24265316 uv24265256 ondisk = -16 ((16) Device or resource busy)) v8
>  210+0+0 (4086289880 0 0) 0x7f38a8005110 con 0x7f3868003380
>
>
> There are no issues with the OSDs all other stuff in the cluster works
> (rbd, images to openstack etc.)
>
>
> Also that command with appending debug 

Re: [ceph-users] radosgw multizone not syncing large bucket completly to other zone

2018-07-08 Thread Orit Wasserman
Hi Enrico,

On Fri, Jun 29, 2018 at 7:50 PM Enrico Kern  wrote:

> hmm that also pops up right away when i restart all radosgw instances. But
> i will check further and see if i can find something. Maybe doing the
> upgrade to mimic too.
>
> That bucket is basically under load on the master zone all the time as we
> use it as historical storage for druid, so there is constantly data written
> to it. I just dont get why disabling/enabling sync on the bucket flawless
> syncs everything while if i just keep it enabled it stops syncing at all.
> For the last days i was just running disabling/enabling for the bucket in a
> while loop with 30 minute interval, but thats no persistent fix ;)
>
>
Are you using Haproxy? we have seen sync stales with it.
The simplest work around is to configure the radosgw's addresses as the
sync endpoints not the haproxy's.

Regards,
Orit



> On Fri, Jun 29, 2018 at 6:15 PM Yehuda Sadeh-Weinraub 
> wrote:
>
>>
>>
>> On Fri, Jun 29, 2018 at 8:48 AM, Enrico Kern 
>> wrote:
>>
>>> also when i try to sync the bucket manual i get this error:
>>>
>>> ERROR: sync.run() returned ret=-16
>>> 2018-06-29 15:47:50.137268 7f54b7e4ecc0  0 data sync: ERROR: failed to
>>> read sync status for
>>> bucketname:6a9448d2-bdba-4bec-aad6-aba72cd8eac6.27150814.1
>>>
>>> it works flawless with all other buckets.
>>>
>>
>> error 16 is EBUSY: meaning it can't take a lease to do work on the
>> bucket. This usually happens when another entity (e.g., a running radosgw
>> process) is working on it at the same time. Either something took the lease
>> and never gave it back (leases shouldn't be indefinite, usually are being
>> taken for a short period but are renewed periodically), or there might be
>> some other bug related to the lease itself. I would start by first figuring
>> out whether it's the first case or the second one. On the messenger log
>> there should be a message prior to that that shows the operation that got
>> the -16 as a response (should have something like "...=-16 (Device or
>> resource busy)" in it). The same line would also contain the name of the
>> rados object that is used to manage the lease. Try to look at the running
>> radosgw log at the same time when this happens, and check whether there are
>> other operations on that object.
>> One thing to note is that if you run a sync on a bucket and stop it
>> uncleanly in the middle (e.g., like killing the process), the leak will
>> stay locked for a period of time (Something in the order of 1 to 2 minutes).
>>
>> Yehuda
>>
>>>
>>>
>>> On Fri, Jun 29, 2018 at 5:39 PM Enrico Kern 
>>> wrote:
>>>
 Hello,

 thanks for the reply.

 We have around 200k objects in the bucket. It is not automatic
 resharded (is that even supported in multisite?)

 What i see when i run a complete data sync with the debug logs after a
 while i see alot of informations that it is unable to perform some log and
 also some device or resource busy (also with alot of different osds,
 restarting the osds also doesnt make this error going away):


 018-06-29 15:18:30.391085 7f38bf882cc0 20
 cr:s=0x55de55700b20:op=0x55de55717010:20RGWContinuousLeaseCR: couldn't lock
 amsterdam.rgw.log:datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.59:sync_lock:
 retcode=-16

 2018-06-29 15:18:30.391094 7f38bf882cc0 20
 cr:s=0x55de55732750:op=0x55de5572d970:20RGWContinuousLeaseCR: couldn't lock
 amsterdam.rgw.log:datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.10:sync_lock:
 retcode=-16

 2018-06-29 15:22:01.618744 7f38ad4c7700  1 -- 10.30.3.67:0/3390890604
 <== osd.43 10.30.3.44:6800/29982 13272  osd_op_reply(258628
 datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.52 [call]
 v14448'24265315 uv24265266 ondisk = -16 ((16) Device or resource busy)) v8
  209+0+0 (2379682838 0 0) 0x7f38a8005110 con 0x7f3868003380

 2018-06-29 15:22:01.618829 7f38ad4c7700  1 -- 10.30.3.67:0/3390890604
 <== osd.43 10.30.3.44:6800/29982 13273  osd_op_reply(258629
 datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.105 [call]
 v14448'24265316 uv24265256 ondisk = -16 ((16) Device or resource busy)) v8
  210+0+0 (4086289880 0 0) 0x7f38a8005110 con 0x7f3868003380


 There are no issues with the OSDs all other stuff in the cluster works
 (rbd, images to openstack etc.)


 Also that command with appending debug never finishes.

 On Tue, Jun 26, 2018 at 5:45 PM Yehuda Sadeh-Weinraub <
 yeh...@redhat.com> wrote:

>
>
> On Sun, Jun 24, 2018 at 12:59 AM, Enrico Kern <
> enrico.k...@glispamedia.com> wrote:
>
>> Hello,
>>
>> We have two ceph luminous clusters (12.2.5).
>>
>> recently one of our big buckets stopped syncing properly. We have a
>> one specific bucket which is around 30TB in size consisting of alot of
>> directories 

Re: [ceph-users] radosgw multizone not syncing large bucket completly to other zone

2018-06-29 Thread Enrico Kern
hmm that also pops up right away when i restart all radosgw instances. But
i will check further and see if i can find something. Maybe doing the
upgrade to mimic too.

That bucket is basically under load on the master zone all the time as we
use it as historical storage for druid, so there is constantly data written
to it. I just dont get why disabling/enabling sync on the bucket flawless
syncs everything while if i just keep it enabled it stops syncing at all.
For the last days i was just running disabling/enabling for the bucket in a
while loop with 30 minute interval, but thats no persistent fix ;)

On Fri, Jun 29, 2018 at 6:15 PM Yehuda Sadeh-Weinraub 
wrote:

>
>
> On Fri, Jun 29, 2018 at 8:48 AM, Enrico Kern 
> wrote:
>
>> also when i try to sync the bucket manual i get this error:
>>
>> ERROR: sync.run() returned ret=-16
>> 2018-06-29 15:47:50.137268 7f54b7e4ecc0  0 data sync: ERROR: failed to
>> read sync status for
>> bucketname:6a9448d2-bdba-4bec-aad6-aba72cd8eac6.27150814.1
>>
>> it works flawless with all other buckets.
>>
>
> error 16 is EBUSY: meaning it can't take a lease to do work on the bucket.
> This usually happens when another entity (e.g., a running radosgw process)
> is working on it at the same time. Either something took the lease and
> never gave it back (leases shouldn't be indefinite, usually are being taken
> for a short period but are renewed periodically), or there might be some
> other bug related to the lease itself. I would start by first figuring out
> whether it's the first case or the second one. On the messenger log there
> should be a message prior to that that shows the operation that got the -16
> as a response (should have something like "...=-16 (Device or resource
> busy)" in it). The same line would also contain the name of the rados
> object that is used to manage the lease. Try to look at the running radosgw
> log at the same time when this happens, and check whether there are other
> operations on that object.
> One thing to note is that if you run a sync on a bucket and stop it
> uncleanly in the middle (e.g., like killing the process), the leak will
> stay locked for a period of time (Something in the order of 1 to 2 minutes).
>
> Yehuda
>
>>
>>
>> On Fri, Jun 29, 2018 at 5:39 PM Enrico Kern 
>> wrote:
>>
>>> Hello,
>>>
>>> thanks for the reply.
>>>
>>> We have around 200k objects in the bucket. It is not automatic resharded
>>> (is that even supported in multisite?)
>>>
>>> What i see when i run a complete data sync with the debug logs after a
>>> while i see alot of informations that it is unable to perform some log and
>>> also some device or resource busy (also with alot of different osds,
>>> restarting the osds also doesnt make this error going away):
>>>
>>>
>>> 018-06-29 15:18:30.391085 7f38bf882cc0 20
>>> cr:s=0x55de55700b20:op=0x55de55717010:20RGWContinuousLeaseCR: couldn't lock
>>> amsterdam.rgw.log:datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.59:sync_lock:
>>> retcode=-16
>>>
>>> 2018-06-29 15:18:30.391094 7f38bf882cc0 20
>>> cr:s=0x55de55732750:op=0x55de5572d970:20RGWContinuousLeaseCR: couldn't lock
>>> amsterdam.rgw.log:datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.10:sync_lock:
>>> retcode=-16
>>>
>>> 2018-06-29 15:22:01.618744 7f38ad4c7700  1 -- 10.30.3.67:0/3390890604
>>> <== osd.43 10.30.3.44:6800/29982 13272  osd_op_reply(258628
>>> datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.52 [call]
>>> v14448'24265315 uv24265266 ondisk = -16 ((16) Device or resource busy)) v8
>>>  209+0+0 (2379682838 0 0) 0x7f38a8005110 con 0x7f3868003380
>>>
>>> 2018-06-29 15:22:01.618829 7f38ad4c7700  1 -- 10.30.3.67:0/3390890604
>>> <== osd.43 10.30.3.44:6800/29982 13273  osd_op_reply(258629
>>> datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.105 [call]
>>> v14448'24265316 uv24265256 ondisk = -16 ((16) Device or resource busy)) v8
>>>  210+0+0 (4086289880 0 0) 0x7f38a8005110 con 0x7f3868003380
>>>
>>>
>>> There are no issues with the OSDs all other stuff in the cluster works
>>> (rbd, images to openstack etc.)
>>>
>>>
>>> Also that command with appending debug never finishes.
>>>
>>> On Tue, Jun 26, 2018 at 5:45 PM Yehuda Sadeh-Weinraub 
>>> wrote:
>>>


 On Sun, Jun 24, 2018 at 12:59 AM, Enrico Kern <
 enrico.k...@glispamedia.com> wrote:

> Hello,
>
> We have two ceph luminous clusters (12.2.5).
>
> recently one of our big buckets stopped syncing properly. We have a
> one specific bucket which is around 30TB in size consisting of alot of
> directories with each one having files of 10-20MB.
>
> The secondary zone is often completly missing multiple days of data in
> this bucket, while all other smaller buckets sync just fine.
>
> Even with the complete data missing radosgw-admin sync status always
> says everything is fine.
>
> the sync error log doesnt show anything for those days.
>
> Running

Re: [ceph-users] radosgw multizone not syncing large bucket completly to other zone

2018-06-29 Thread Yehuda Sadeh-Weinraub
On Fri, Jun 29, 2018 at 8:48 AM, Enrico Kern  wrote:

> also when i try to sync the bucket manual i get this error:
>
> ERROR: sync.run() returned ret=-16
> 2018-06-29 15:47:50.137268 7f54b7e4ecc0  0 data sync: ERROR: failed to
> read sync status for bucketname:6a9448d2-bdba-4bec-
> aad6-aba72cd8eac6.27150814.1
>
> it works flawless with all other buckets.
>

error 16 is EBUSY: meaning it can't take a lease to do work on the bucket.
This usually happens when another entity (e.g., a running radosgw process)
is working on it at the same time. Either something took the lease and
never gave it back (leases shouldn't be indefinite, usually are being taken
for a short period but are renewed periodically), or there might be some
other bug related to the lease itself. I would start by first figuring out
whether it's the first case or the second one. On the messenger log there
should be a message prior to that that shows the operation that got the -16
as a response (should have something like "...=-16 (Device or resource
busy)" in it). The same line would also contain the name of the rados
object that is used to manage the lease. Try to look at the running radosgw
log at the same time when this happens, and check whether there are other
operations on that object.
One thing to note is that if you run a sync on a bucket and stop it
uncleanly in the middle (e.g., like killing the process), the leak will
stay locked for a period of time (Something in the order of 1 to 2 minutes).

Yehuda

>
>
> On Fri, Jun 29, 2018 at 5:39 PM Enrico Kern 
> wrote:
>
>> Hello,
>>
>> thanks for the reply.
>>
>> We have around 200k objects in the bucket. It is not automatic resharded
>> (is that even supported in multisite?)
>>
>> What i see when i run a complete data sync with the debug logs after a
>> while i see alot of informations that it is unable to perform some log and
>> also some device or resource busy (also with alot of different osds,
>> restarting the osds also doesnt make this error going away):
>>
>>
>> 018-06-29 15:18:30.391085 7f38bf882cc0 20 cr:s=0x55de55700b20:op=
>> 0x55de55717010:20RGWContinuousLeaseCR: couldn't lock
>> amsterdam.rgw.log:datalog.sync-status.shard.6a9448d2-
>> bdba-4bec-aad6-aba72cd8eac6.59:sync_lock: retcode=-16
>>
>> 2018-06-29 15:18:30.391094 7f38bf882cc0 20 cr:s=0x55de55732750:op=
>> 0x55de5572d970:20RGWContinuousLeaseCR: couldn't lock
>> amsterdam.rgw.log:datalog.sync-status.shard.6a9448d2-
>> bdba-4bec-aad6-aba72cd8eac6.10:sync_lock: retcode=-16
>>
>> 2018-06-29 15:22:01.618744 7f38ad4c7700  1 -- 10.30.3.67:0/3390890604
>> <== osd.43 10.30.3.44:6800/29982 13272  osd_op_reply(258628
>> datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.52 [call]
>> v14448'24265315 uv24265266 ondisk = -16 ((16) Device or resource busy)) v8
>>  209+0+0 (2379682838 0 0) 0x7f38a8005110 con 0x7f3868003380
>>
>> 2018-06-29 15:22:01.618829 7f38ad4c7700  1 -- 10.30.3.67:0/3390890604
>> <== osd.43 10.30.3.44:6800/29982 13273  osd_op_reply(258629
>> datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.105
>> [call] v14448'24265316 uv24265256 ondisk = -16 ((16) Device or resource
>> busy)) v8  210+0+0 (4086289880 0 0) 0x7f38a8005110 con 0x7f3868003380
>>
>>
>> There are no issues with the OSDs all other stuff in the cluster works
>> (rbd, images to openstack etc.)
>>
>>
>> Also that command with appending debug never finishes.
>>
>> On Tue, Jun 26, 2018 at 5:45 PM Yehuda Sadeh-Weinraub 
>> wrote:
>>
>>>
>>>
>>> On Sun, Jun 24, 2018 at 12:59 AM, Enrico Kern <
>>> enrico.k...@glispamedia.com> wrote:
>>>
 Hello,

 We have two ceph luminous clusters (12.2.5).

 recently one of our big buckets stopped syncing properly. We have a one
 specific bucket which is around 30TB in size consisting of alot of
 directories with each one having files of 10-20MB.

 The secondary zone is often completly missing multiple days of data in
 this bucket, while all other smaller buckets sync just fine.

 Even with the complete data missing radosgw-admin sync status always
 says everything is fine.

 the sync error log doesnt show anything for those days.

 Running

 radosgw-admin metadata sync and data sync also doesnt solve the issue.
 The only way of making it sync again is to disable and re-eanble the sync.
 That needs to be done as often as like 10 times in an hour to make it sync
 properly.

 radosgw-admin bucket sync disable
 radosgw-admin bucket sync enable

 when i run data init i sometimes get this:

  radosgw-admin data sync init --source-zone berlin
 2018-06-24 07:55:46.337858 7fe7557fa700  0 ERROR: failed to distribute
 cache for amsterdam.rgw.log:datalog.sync-status.6a9448d2-bdba-
 4bec-aad6-aba72cd8eac6

 Sometimes when really alot of data is missing (yesterday it was more
 then 1 month) this helps making them get in sync again when run on the
 secondary 

Re: [ceph-users] radosgw multizone not syncing large bucket completly to other zone

2018-06-29 Thread Enrico Kern
also when i try to sync the bucket manual i get this error:

ERROR: sync.run() returned ret=-16
2018-06-29 15:47:50.137268 7f54b7e4ecc0  0 data sync: ERROR: failed to read
sync status for bucketname:6a9448d2-bdba-4bec-aad6-aba72cd8eac6.27150814.1

it works flawless with all other buckets.


On Fri, Jun 29, 2018 at 5:39 PM Enrico Kern  wrote:

> Hello,
>
> thanks for the reply.
>
> We have around 200k objects in the bucket. It is not automatic resharded
> (is that even supported in multisite?)
>
> What i see when i run a complete data sync with the debug logs after a
> while i see alot of informations that it is unable to perform some log and
> also some device or resource busy (also with alot of different osds,
> restarting the osds also doesnt make this error going away):
>
>
> 018-06-29 15:18:30.391085 7f38bf882cc0 20
> cr:s=0x55de55700b20:op=0x55de55717010:20RGWContinuousLeaseCR: couldn't lock
> amsterdam.rgw.log:datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.59:sync_lock:
> retcode=-16
>
> 2018-06-29 15:18:30.391094 7f38bf882cc0 20
> cr:s=0x55de55732750:op=0x55de5572d970:20RGWContinuousLeaseCR: couldn't lock
> amsterdam.rgw.log:datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.10:sync_lock:
> retcode=-16
>
> 2018-06-29 15:22:01.618744 7f38ad4c7700  1 -- 10.30.3.67:0/3390890604 <==
> osd.43 10.30.3.44:6800/29982 13272  osd_op_reply(258628
> datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.52 [call]
> v14448'24265315 uv24265266 ondisk = -16 ((16) Device or resource busy)) v8
>  209+0+0 (2379682838 0 0) 0x7f38a8005110 con 0x7f3868003380
>
> 2018-06-29 15:22:01.618829 7f38ad4c7700  1 -- 10.30.3.67:0/3390890604 <==
> osd.43 10.30.3.44:6800/29982 13273  osd_op_reply(258629
> datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.105 [call]
> v14448'24265316 uv24265256 ondisk = -16 ((16) Device or resource busy)) v8
>  210+0+0 (4086289880 0 0) 0x7f38a8005110 con 0x7f3868003380
>
>
> There are no issues with the OSDs all other stuff in the cluster works
> (rbd, images to openstack etc.)
>
>
> Also that command with appending debug never finishes.
>
> On Tue, Jun 26, 2018 at 5:45 PM Yehuda Sadeh-Weinraub 
> wrote:
>
>>
>>
>> On Sun, Jun 24, 2018 at 12:59 AM, Enrico Kern <
>> enrico.k...@glispamedia.com> wrote:
>>
>>> Hello,
>>>
>>> We have two ceph luminous clusters (12.2.5).
>>>
>>> recently one of our big buckets stopped syncing properly. We have a one
>>> specific bucket which is around 30TB in size consisting of alot of
>>> directories with each one having files of 10-20MB.
>>>
>>> The secondary zone is often completly missing multiple days of data in
>>> this bucket, while all other smaller buckets sync just fine.
>>>
>>> Even with the complete data missing radosgw-admin sync status always
>>> says everything is fine.
>>>
>>> the sync error log doesnt show anything for those days.
>>>
>>> Running
>>>
>>> radosgw-admin metadata sync and data sync also doesnt solve the issue.
>>> The only way of making it sync again is to disable and re-eanble the sync.
>>> That needs to be done as often as like 10 times in an hour to make it sync
>>> properly.
>>>
>>> radosgw-admin bucket sync disable
>>> radosgw-admin bucket sync enable
>>>
>>> when i run data init i sometimes get this:
>>>
>>>  radosgw-admin data sync init --source-zone berlin
>>> 2018-06-24 07:55:46.337858 7fe7557fa700  0 ERROR: failed to distribute
>>> cache for
>>> amsterdam.rgw.log:datalog.sync-status.6a9448d2-bdba-4bec-aad6-aba72cd8eac6
>>>
>>> Sometimes when really alot of data is missing (yesterday it was more
>>> then 1 month) this helps making them get in sync again when run on the
>>> secondary zone:
>>>
>>> radosgw-admin bucket check --fix --check-objects
>>>
>>> how can i debug that problem further? We have so many requests on the
>>> cluster that is is hard to dig something out of the log files..
>>>
>>> Given all the smaller buckets are perfectly in sync i suspect some
>>> problem because of the size of the bucket.
>>>
>>
>> How many objects in the bucket? Is it getting automatically resharded?
>>
>>
>>>
>>> Any points to the right direction are greatly appreciated.
>>>
>>
>> A few things to look at that might help identify the issue.
>>
>> What does this show (I think the luminous command is as follows):
>>
>> $ radosgw-admin bucket sync status --source-zone=
>>
>> You can try manually syncing the bucket, and get specific logs for that
>> operation:
>>
>> $ radosgw-admin bucket sync run -source-zone= --debug-rgw=20
>> --debug-ms=1
>>
>> And you can try getting more info from the sync trace module:
>>
>> $ ceph --admin-daemon  sync trace history
>> 
>>
>> You can also try the 'sync trace show' command.
>>
>>
>> Yehuda
>>
>>
>>
>>>
>>> Regards,
>>>
>>> Enrico
>>>
>>> --
>>>
>>> *Enrico Kern*
>>> VP IT Operations
>>>
>>> enrico.k...@glispa.com
>>> +49 (0) 30 555713017 / +49 (0) 152 26814501
>>> skype: flyersa
>>> LinkedIn Profile 
>>>

Re: [ceph-users] radosgw multizone not syncing large bucket completly to other zone

2018-06-29 Thread Enrico Kern
Hello,

thanks for the reply.

We have around 200k objects in the bucket. It is not automatic resharded
(is that even supported in multisite?)

What i see when i run a complete data sync with the debug logs after a
while i see alot of informations that it is unable to perform some log and
also some device or resource busy (also with alot of different osds,
restarting the osds also doesnt make this error going away):


018-06-29 15:18:30.391085 7f38bf882cc0 20
cr:s=0x55de55700b20:op=0x55de55717010:20RGWContinuousLeaseCR: couldn't lock
amsterdam.rgw.log:datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.59:sync_lock:
retcode=-16

2018-06-29 15:18:30.391094 7f38bf882cc0 20
cr:s=0x55de55732750:op=0x55de5572d970:20RGWContinuousLeaseCR: couldn't lock
amsterdam.rgw.log:datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.10:sync_lock:
retcode=-16

2018-06-29 15:22:01.618744 7f38ad4c7700  1 -- 10.30.3.67:0/3390890604 <==
osd.43 10.30.3.44:6800/29982 13272  osd_op_reply(258628
datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.52 [call]
v14448'24265315 uv24265266 ondisk = -16 ((16) Device or resource busy)) v8
 209+0+0 (2379682838 0 0) 0x7f38a8005110 con 0x7f3868003380

2018-06-29 15:22:01.618829 7f38ad4c7700  1 -- 10.30.3.67:0/3390890604 <==
osd.43 10.30.3.44:6800/29982 13273  osd_op_reply(258629
datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.105 [call]
v14448'24265316 uv24265256 ondisk = -16 ((16) Device or resource busy)) v8
 210+0+0 (4086289880 0 0) 0x7f38a8005110 con 0x7f3868003380


There are no issues with the OSDs all other stuff in the cluster works
(rbd, images to openstack etc.)


Also that command with appending debug never finishes.

On Tue, Jun 26, 2018 at 5:45 PM Yehuda Sadeh-Weinraub 
wrote:

>
>
> On Sun, Jun 24, 2018 at 12:59 AM, Enrico Kern  > wrote:
>
>> Hello,
>>
>> We have two ceph luminous clusters (12.2.5).
>>
>> recently one of our big buckets stopped syncing properly. We have a one
>> specific bucket which is around 30TB in size consisting of alot of
>> directories with each one having files of 10-20MB.
>>
>> The secondary zone is often completly missing multiple days of data in
>> this bucket, while all other smaller buckets sync just fine.
>>
>> Even with the complete data missing radosgw-admin sync status always says
>> everything is fine.
>>
>> the sync error log doesnt show anything for those days.
>>
>> Running
>>
>> radosgw-admin metadata sync and data sync also doesnt solve the issue.
>> The only way of making it sync again is to disable and re-eanble the sync.
>> That needs to be done as often as like 10 times in an hour to make it sync
>> properly.
>>
>> radosgw-admin bucket sync disable
>> radosgw-admin bucket sync enable
>>
>> when i run data init i sometimes get this:
>>
>>  radosgw-admin data sync init --source-zone berlin
>> 2018-06-24 07:55:46.337858 7fe7557fa700  0 ERROR: failed to distribute
>> cache for
>> amsterdam.rgw.log:datalog.sync-status.6a9448d2-bdba-4bec-aad6-aba72cd8eac6
>>
>> Sometimes when really alot of data is missing (yesterday it was more then
>> 1 month) this helps making them get in sync again when run on the secondary
>> zone:
>>
>> radosgw-admin bucket check --fix --check-objects
>>
>> how can i debug that problem further? We have so many requests on the
>> cluster that is is hard to dig something out of the log files..
>>
>> Given all the smaller buckets are perfectly in sync i suspect some
>> problem because of the size of the bucket.
>>
>
> How many objects in the bucket? Is it getting automatically resharded?
>
>
>>
>> Any points to the right direction are greatly appreciated.
>>
>
> A few things to look at that might help identify the issue.
>
> What does this show (I think the luminous command is as follows):
>
> $ radosgw-admin bucket sync status --source-zone=
>
> You can try manually syncing the bucket, and get specific logs for that
> operation:
>
> $ radosgw-admin bucket sync run -source-zone= --debug-rgw=20
> --debug-ms=1
>
> And you can try getting more info from the sync trace module:
>
> $ ceph --admin-daemon  sync trace history
> 
>
> You can also try the 'sync trace show' command.
>
>
> Yehuda
>
>
>
>>
>> Regards,
>>
>> Enrico
>>
>> --
>>
>> *Enrico Kern*
>> VP IT Operations
>>
>> enrico.k...@glispa.com
>> +49 (0) 30 555713017 / +49 (0) 152 26814501
>> skype: flyersa
>> LinkedIn Profile 
>>
>>
>>  
>>
>> *Glispa GmbH* | Berlin Office
>> Sonnenburger Straße 73
>> 
>> 10437 Berlin
>> 

Re: [ceph-users] radosgw multizone not syncing large bucket completly to other zone

2018-06-26 Thread Yehuda Sadeh-Weinraub
On Sun, Jun 24, 2018 at 12:59 AM, Enrico Kern 
wrote:

> Hello,
>
> We have two ceph luminous clusters (12.2.5).
>
> recently one of our big buckets stopped syncing properly. We have a one
> specific bucket which is around 30TB in size consisting of alot of
> directories with each one having files of 10-20MB.
>
> The secondary zone is often completly missing multiple days of data in
> this bucket, while all other smaller buckets sync just fine.
>
> Even with the complete data missing radosgw-admin sync status always says
> everything is fine.
>
> the sync error log doesnt show anything for those days.
>
> Running
>
> radosgw-admin metadata sync and data sync also doesnt solve the issue. The
> only way of making it sync again is to disable and re-eanble the sync. That
> needs to be done as often as like 10 times in an hour to make it sync
> properly.
>
> radosgw-admin bucket sync disable
> radosgw-admin bucket sync enable
>
> when i run data init i sometimes get this:
>
>  radosgw-admin data sync init --source-zone berlin
> 2018-06-24 07:55:46.337858 7fe7557fa700  0 ERROR: failed to distribute
> cache for amsterdam.rgw.log:datalog.sync-status.6a9448d2-bdba-
> 4bec-aad6-aba72cd8eac6
>
> Sometimes when really alot of data is missing (yesterday it was more then
> 1 month) this helps making them get in sync again when run on the secondary
> zone:
>
> radosgw-admin bucket check --fix --check-objects
>
> how can i debug that problem further? We have so many requests on the
> cluster that is is hard to dig something out of the log files..
>
> Given all the smaller buckets are perfectly in sync i suspect some problem
> because of the size of the bucket.
>

How many objects in the bucket? Is it getting automatically resharded?


>
> Any points to the right direction are greatly appreciated.
>

A few things to look at that might help identify the issue.

What does this show (I think the luminous command is as follows):

$ radosgw-admin bucket sync status --source-zone=

You can try manually syncing the bucket, and get specific logs for that
operation:

$ radosgw-admin bucket sync run -source-zone= --debug-rgw=20
--debug-ms=1

And you can try getting more info from the sync trace module:

$ ceph --admin-daemon  sync trace history


You can also try the 'sync trace show' command.


Yehuda



>
> Regards,
>
> Enrico
>
> --
>
> *Enrico Kern*
> VP IT Operations
>
> enrico.k...@glispa.com
> +49 (0) 30 555713017 / +49 (0) 152 26814501
> skype: flyersa
> LinkedIn Profile 
>
>
>  
>
> *Glispa GmbH* | Berlin Office
> Sonnenburger Straße 73
> 
> 10437 Berlin
> 
> |
> 
>  Germany
> 
>
> Managing Director Din Karol-Gavish
> Registered in Berlin
> AG Charlottenburg |
> 
>  HRB
> 114678B
> –
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com