First I'm addressing increasing your PG counts as that is what you specifically 
asked about, however I do not believe that is your problem and I'll explain 
that later.

There are a few recent threads on the ML about increasing the pg_num and 
pgp_num on a cluster.  But if you learn how to search the archives, let me 
know... I always get an error.  The gist is to set nobackfill, norecover, 
noout, and nodown on your cluster; increase your pg_num and then pgp_num in 
small increments waiting for all peering, creating, inactive, etc pgs to clear 
before doing the next set of pgs (generally 256 at a time, but we're seeing 
poor performance with this number and since we have it mostly automated we're 
starting to increment by 64 to mitigate the cluster impact).

What percentage of your data is in each of your pools?  Based on your amount of 
PGs you should have 2/3 of your data is in Volumes and 1/3 in Images.  If that 
is correct and will continue to be true, then you want to keep that ratio 
similar.  Let me know if you have any questions on this

This is where I propose where your slow and blocked requests come from.  Every 
time we have persistent, but seemingly random, slow/blocked requests it is 
always PG sub-folders splitting.  The threshold for this is calculated off of a 
constant and 2 settings that you can set in your config (filestore merge 
threshold, filestore split multiple).  "filestore split multiple" is not a 
direct value used by the cluster, it is a variable used to find the value used 
by the cluster (the equation is shown later).  "filestore merge threshold" is 
how many objects in subfolders before it will merge them back together into 1 
directory; this is a sum of all objects in subfolders.  If you set this to 
negative, then you will never merge subfolders, but the value is still used in 
the equation with "filestore split multiple" (notice the abs in the equation 
ignoring if this value is negative).  The equation for how many objects you can 
have in a folder before it splits into sub-folders is

 = 16 * { filestore split multiple } * abs( { filestore merge threshold } )

These settings cannot be injected, you must change your cluster config and 
restart your osds to change the settings.  The way you can tell if this is 
happening on your cluster is to check what your values are and plug them into 
the equation and then check a pg in your cluster with a command similar to this 
to see if you are in the middle of splitting sub-folders or recently split 

cd /var/lib/ceph/osd/ceph-$osd/current/
for folder in *_head; do echo $folder; ls -1R $folder | cut -d. -f1 | uniq -c | 
grep -Ev '^\s+1 '; done

That assumes you go into a valid osd current folder and will give you a count 
of all objects inside of your sub-folders for each PG.  If you are in the 
middle of splitting sub-folders, then you will see that the smallest numbers 
are about 1/16 of the largest numbers.  That would be because they are dividing 
all of the objects into 16 subfolders.

When our clusters have this happen, we don't only see slow/blocked requests but 
we also see osds being marked down for a bit.  We have to combat this by 
injecting '--osd_heartbeat_grace=180' to a high enough value to allow the osd 
to finish splitting it's sub-folders before it continues to respond to 
requests.  This value is how long an osd will wait for a response from another 
osd before telling the mons that it's not responding.  We used to use 180 (3 
minutes), but that is no longer high enough and we're now using 240 when we see 
that a cluster is splitting sub-folders and errantly marking osds down.


[cid:imagea221f0.JPG@2ce1ea0f.4380559c]<>       David 
Turner | Cloud Operations Engineer | StorageCraft Technology 
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943


If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.


From: ceph-users [] on behalf of Emilio Moreno 
Fernandez []
Sent: Tuesday, October 11, 2016 5:29 AM
To: ''
Subject: [ceph-users] Modify placement group pg and pgp in production 


We have production platform of Ceph in our farm of Openstack. This platform 
have following specs:

1 Admin Node
3 Monitors
7 Ceph Nodes with 160 OSD of SAS HDD 1.2TB 10K. Maybe 30 OSD have Journal 
SSD...we are in update progress...;-)

All network have 10GB Ethernet link and we have some problems now of slow and 
block problem we are diagnosing the platform.

One of our problems are the placement groups, this number has not been changed 
by mistake for a long time...our pools:

    173T     56319G         118T         68.36      10945k
    NAME        ID     CATEGORY     USED       %USED     MAX AVAIL     OBJECTS  
   DIRTY     READ       WRITE
    rbd         0      -                 0         0        14992G           0  
       0          1      66120
    volumes     6      -            42281G     23.75        14992G     8871636  
   8663k     47690M     55474M
    images      7      -            18151G     10.20        14992G     2324108  
   2269k      1456M      1622k
    backups     8      -                 0         0        14992G           1  
       1      18578       104k
    vms         9      -            91575M      0.05        14992G       12827  
   12827      2526k      6863k

And our PG on pools are (only of used pools):

Volumes              2048
Images                 1024

We think that our performance problem, after verify network, servers, hardware, 
disk, software, bugs, logs, the number of PG volumes...

Our Question:

How we can update de pg number and after pgp number in production environment 
without interrupting service, poor performance or down the virtual 
The last update was made from 512 to 1024 in the pool of pictures and had a 
drop service 2 hours because the platform did not support data traffic....we 
are scaried :-(
We can do this change with little increments in two weeks? How?

Thanks Thanks Thanks


Emilio Moreno Fernández

ceph-users mailing list

Reply via email to