Re: [ceph-users] RGW Dynamic bucket index resharding keeps resharding all buckets

2018-06-18 Thread Sander van Schie / True
Thanks, I created the following issue: https://tracker.ceph.com/issues/24551

Sander
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW Dynamic bucket index resharding keeps resharding all buckets

2018-06-18 Thread Sander van Schie / True
While Ceph was resharding buckets over and over again, the maximum available 
storage as reported by 'ceph df' also decreased by about 20%, while usage 
stayed the same, we have yet to find out where the missing storage went. The 
decreasing stopped once we disabled resharding. 

Any help would be greatly appreciated.

Thanks,

Sander 


From: Sander van Schie / True
Sent: Friday, June 15, 2018 14:19
To: ceph-users@lists.ceph.com
Subject: RGW Dynamic bucket index resharding keeps resharding all buckets

Hello,

We're into some problems with dynamic bucket index resharding. After an upgrade 
from Ceph 12.2.2 to 12.2.5, which fixed an issue with the resharding when using 
tenants (which we do), the cluster was busy resharding for 2 days straight, 
resharding the same buckets over and over again.

After disabling it and re-enabling it a while later, it resharded all buckets 
again and then kept quiet for a bit. Later on it started resharding buckets 
over and over again, even buckets which didn't have any data added in the 
meantime. In the reshard list it always says 'old_num_shards: 1' for every 
bucket, even though I can confirm with 'bucket stats' there's already the 
desired amount of shards present. It looks like the background process which 
scans buckets doesn't properly recognize the amount of shards a bucket 
currently has. When I manually add a reshard job, it does properly recognize 
the current amount of shards.

On a sidenote, we had two buckets in the reshard list which were removed a long 
while ago. We were unable to cancel the reshard job for those buckets. After 
recreating the users and buckets we were able to remove them from the list 
though, so they are no longer present. Probably not relevant, but you never 
know.

Are we missing something, or are we running into a bug?

Thanks,

Sander
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW Dynamic bucket index resharding keeps resharding all buckets

2018-06-15 Thread Sander van Schie / True
Hello,

We're into some problems with dynamic bucket index resharding. After an upgrade 
from Ceph 12.2.2 to 12.2.5, which fixed an issue with the resharding when using 
tenants (which we do), the cluster was busy resharding for 2 days straight, 
resharding the same buckets over and over again.

After disabling it and re-enabling it a while later, it resharded all buckets 
again and then kept quiet for a bit. Later on it started resharding buckets 
over and over again, even buckets which didn't have any data added in the 
meantime. In the reshard list it always says 'old_num_shards: 1' for every 
bucket, even though I can confirm with 'bucket stats' there's already the 
desired amount of shards present. It looks like the background process which 
scans buckets doesn't properly recognize the amount of shards a bucket 
currently has. When I manually add a reshard job, it does properly recognize 
the current amount of shards.

On a sidenote, we had two buckets in the reshard list which were removed a long 
while ago. We were unable to cancel the reshard job for those buckets. After 
recreating the users and buckets we were able to remove them from the list 
though, so they are no longer present. Probably not relevant, but you never 
know.

Are we missing something, or are we running into a bug?

Thanks,

Sander
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance issues with deep-scrub since upgrading from v12.2.2 to v12.2.5

2018-06-14 Thread Sander van Schie / True
Awesome, thanks for the quick replies and insights.


Seems like this is the issue you're talking about:

http://tracker.ceph.com/issues/22769 which is set to be released in v12.2.6.


We'll focus on investigating the issue regarding the resharding of buckets, 
hopefully this will solve the issue for us.


Sander



From: Gregory Farnum 
Sent: Thursday, June 14, 2018 22:45
To: Sander van Schie / True
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Performance issues with deep-scrub since upgrading 
from v12.2.2 to v12.2.5

Yes. Deep scrub of a bucket index pool requires reading all the omap keys, and 
the rgw bucket indices can get quite large. The OSD will limit the number of 
keys it reads at a time to try and avoid overwhelming things.

We backported to luminous (but after the 12.2.5 release, it looks like) a 
commit that would let client requests preempt the scrubbing reads, which would 
make it restart several times if the bucket index keeps getting written to. I 
think there may also have been a bug fix where the OSD had been able to issue 
client access during some scrubs that could result in issues, but I'm not sure 
when that backport would have happened.

I'd try and figure out why the RGW resharding wasn't working properly though, 
as preventing resharding when it wants to is inviting issues with the 
underlying RADOS objects.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance issues with deep-scrub since upgrading from v12.2.2 to v12.2.5

2018-06-14 Thread Sander van Schie / True
Thank you for your reply.


I'm not sure if this is the case, since we have a rather small cluster and the 
PGs have at most just over 10k objects (total objects in the cluster is about 9 
million). During the 10 minute scrubs we're seeing a steady 10k iops on the 
underlying block device of the OSD (which are enterprise SSDs), both on the 
primary OSD and on a secondy OSD. It's all read IOPS and throughput is about 65 
MiB/s. I'm not very familiar with the deep scrub process, but this seems a bit 
much to me. Can this still be intended behaviour? This would mean it's only 
able to check like 15-20 objects a second with SSD OSDs while doing 8k IOPS. 
The strange thing is that we didn't see this happen at all before the upgrade, 
it started right after.


I also checked the PGs of which the deep-scrub finished in a couple of second, 
most of those have 5k objects. The PGs for which deep-scrub is giving issues, 
seem to all be part of the rgw bucket index pool.


Since we're using tenants for RGW, the dynamic bucket index resharding didn't 
work before the update of Ceph (

http://tracker.ceph.com/issues/22046)<http://tracker.ceph.com/issues/22046>. 
After the update it was hammering the cluster quite hard, doing about 30-60k 
write iops on the rgw index pool for two days straight. The resharding list 
also kept showing pretty much completely different data every few seconds. 
Since this was also affecting performance, we temporarily disabled this. Could 
this somehow be related?


Thanks


Sander




From: Gregory Farnum 
Sent: Thursday, June 14, 2018 19:45
To: Sander van Schie / True
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Performance issues with deep-scrub since upgrading 
from v12.2.2 to v12.2.5

Deep scrub needs to read every object in the pg. if some pgs are only taking 5 
seconds they must be nearly empty (or maybe they only contain objects with 
small amounts of omap or something). Ten minutes is perfectly reasonable, but 
it is an added load on the cluster as it does all those object reads. Perhaps 
your configured scrub rates are using enough iops that you don’t have enough 
for your client workloads.
-Greg
On Thu, Jun 14, 2018 at 11:37 AM Sander van Schie / True 
mailto:sander.vansc...@true.nl>> wrote:
Hello,

We recently upgraded Ceph from version 12.2.2 to version 12.2.5. Since the 
upgrade we've been having performance issues which seem to relate to when 
deep-scrub actions are performed.

Most of the time deep-scrub actions only takes a couple of seconds at most, 
however ocassionaly it takes 10 minutes. It's either a few seconds or 10 
minutes, never a couple of minutes or anything else in between. This has been 
happening since the upgrade.

For example see the following:

2018-06-14 10:11:46.337086 7fbd3528b700  0 log_channel(cluster) log [DBG] : 
15.2dc deep-scrub starts
2018-06-14 10:11:50.947843 7fbd3528b700  0 log_channel(cluster) log [DBG] : 
15.2dc deep-scrub ok

2018-06-14 10:45:49.575042 7fbd32a86700  0 log_channel(cluster) log [DBG] : 
14.1 deep-scrub starts
2018-06-14 10:55:53.326309 7fbd32a86700  0 log_channel(cluster) log [DBG] : 
14.1 deep-scrub ok

2018-06-14 10:58:28.652360 7fbd33a88700  0 log_channel(cluster) log [DBG] : 
15.5f deep-scrub starts
2018-06-14 10:58:33.411769 7fbd2fa80700  0 log_channel(cluster) log [DBG] : 
15.5f deep-scrub ok

The scrub on PG 14.1 took pretty much exactly 10 minutes, the others only about 
5 seconds. It matches the value of "osd scrub finalize thread timeout" which is 
currently set to 10 minutes, however I'm not sure if it's related or just a 
coincidence. It's not just this PG, but there's a bunch of them, also on 
difference nodes and OSD's.

PG dump for this problematic PG is as follows:

PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES  LOG  
DISK_LOG STATESTATE_STAMPVERSION   REPORTED   
UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUBSCRUB_STAMP   
 LAST_DEEP_SCRUB DEEP_SCRUB_STAMP   SNAPTRIMQ_LEN
14.1  10573  00 0   0  0 1579   
  1579 active+clean 2018-06-14 15:47:32.83  1215'1291261   1215:7062174   
[3,8,20]  3   [3,8,20]  3  1179'1288697 2018-06-14 
10:55:53.3263201179'1288697 2018-06-14 10:55:53.326320 0

During the longer running deep-scrub actions we're also running into 
performance problems.

Any idea what's going wrong?

Thanks

Sander
___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Performance issues with deep-scrub since upgrading from v12.2.2 to v12.2.5

2018-06-14 Thread Sander van Schie / True
Hello,

We recently upgraded Ceph from version 12.2.2 to version 12.2.5. Since the 
upgrade we've been having performance issues which seem to relate to when 
deep-scrub actions are performed.

Most of the time deep-scrub actions only takes a couple of seconds at most, 
however ocassionaly it takes 10 minutes. It's either a few seconds or 10 
minutes, never a couple of minutes or anything else in between. This has been 
happening since the upgrade.

For example see the following:

2018-06-14 10:11:46.337086 7fbd3528b700  0 log_channel(cluster) log [DBG] : 
15.2dc deep-scrub starts
2018-06-14 10:11:50.947843 7fbd3528b700  0 log_channel(cluster) log [DBG] : 
15.2dc deep-scrub ok

2018-06-14 10:45:49.575042 7fbd32a86700  0 log_channel(cluster) log [DBG] : 
14.1 deep-scrub starts
2018-06-14 10:55:53.326309 7fbd32a86700  0 log_channel(cluster) log [DBG] : 
14.1 deep-scrub ok

2018-06-14 10:58:28.652360 7fbd33a88700  0 log_channel(cluster) log [DBG] : 
15.5f deep-scrub starts
2018-06-14 10:58:33.411769 7fbd2fa80700  0 log_channel(cluster) log [DBG] : 
15.5f deep-scrub ok

The scrub on PG 14.1 took pretty much exactly 10 minutes, the others only about 
5 seconds. It matches the value of "osd scrub finalize thread timeout" which is 
currently set to 10 minutes, however I'm not sure if it's related or just a 
coincidence. It's not just this PG, but there's a bunch of them, also on 
difference nodes and OSD's.

PG dump for this problematic PG is as follows:

PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES  LOG  
DISK_LOG STATESTATE_STAMPVERSION   REPORTED   
UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUBSCRUB_STAMP   
 LAST_DEEP_SCRUB DEEP_SCRUB_STAMP   SNAPTRIMQ_LEN
14.1  10573  00 0   0  0 1579   
  1579 active+clean 2018-06-14 15:47:32.83  1215'1291261   1215:7062174   
[3,8,20]  3   [3,8,20]  3  1179'1288697 2018-06-14 
10:55:53.3263201179'1288697 2018-06-14 10:55:53.326320 0

During the longer running deep-scrub actions we're also running into 
performance problems.

Any idea what's going wrong?

Thanks

Sander
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com