We regretably have to increase PG's in a ceph cluster this way more often than
anyone should ever need to. As such, we have scripted it out. A basic version
of the script that should work for you is below.
First, create a function to check for any pg states that you don't want to
continue if any pgs are in them (better than duplicating code). Second, set
the flags so your cluster doesn't die while you do this. Third, set your
numbers of current PGs and the destination PGs for the for loop. The Loop will
ignore any number not divisible by 256. As you've found, increasing by 256 is
a good number. More than that and you'll run into issues of your cluster
curling into a fetal position and crying. This will loop through increasing
your pg_num, wait until everything is settled, then increase your pgp_num. The
seemingly excessive sleeps are to help the cluster be able to resolve blocked
requests that will still happen during this. Lastly unset the flags to let the
cluster start moving the data around.
One thing to note, in a cluster with 800-1000 HDD OSDS with SSD journals, going
from 16k to 32k PGs, We set maxbackfills to 1 during busy times and 2 during
idle times. maxbackfills of more than 2 is not beneficial for us to increasing
our pg count. We have tested maxbackfills of 2 and 5, both took the entire
weekend to add 4k PGs. We also do not add all of the PGs at once. We do 4k
each weekend and 2k during the week waiting for the cluster to finish each time
to give our mon stores a chance to compact before we continue.
check_health(){
#If this finds any of the strings in the grep, then it will return 0, otherwise
it will return 1 (or whatever the grep return code is)
ceph health | grep 'peering\|stale\|activating\|creating\|down' > /dev/null
return $?
}
for flag in nobackfill norecover noout nodown
do
ceph osd set $flag
done
#Set your current and destination pg counts here.
for num in {2048..16384}
do
[ $(( $i % 256 )) -eq 0 ] || continue
while sleep 10
do
check_health
if [ $? -ne 0 ]
then
#This assumes your pool is named rbd
ceph osd pool set rbd pg_num $num
break
fi
done
sleep 60
while sleep 10
do
check_health
if [ $? -ne 0 ]
then
#This assumes your pool is named rbd
ceph osd pool set rbd pgp_num $num
break
fi
done
sleep 60
done
for flag in nobackfill norecover noout nodown
do
ceph osd unset $flag
done
________________________________
[cid:[email protected]]<https://storagecraft.com> David
Turner | Cloud Operations Engineer | StorageCraft Technology
Corporation<https://storagecraft.com>
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943
________________________________
If you are not the intended recipient of this message or received it
erroneously, please notify the sender and delete it, together with any
attachments, and be advised that any dissemination or copying of this message
is prohibited.
________________________________
________________________________
From: ceph-users [[email protected]] on behalf of Matteo
Dacrema [[email protected]]
Sent: Monday, September 19, 2016 2:51 AM
To: Will.Boege; [email protected]
Subject: Re: [ceph-users] [EXTERNAL] Re: Increase PG number
Hi,
I’ve 3 different cluster.
The first I’ve been able to upgrade from 1024 to 2048 pgs with 10 minutes of
"io freeze”.
The second I’ve been able to upgrade from 368 to 512 in a sec without any
performance issue, but from 512 to 1024 it take over 20 minutes to create pgs.
The third I’ve to upgrade is now 2048 pgs and I’ve to take it to 16384. So what
I’m wondering is how to do it with minimum performance impact.
Maybe the best way is to upgrade by 256 to 256 pg and pgp num each time letting
the cluster to rebalance every time.
Thanks
Matteo
This email and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed. If
you have received this email in error please notify the system manager. This
message contains confidential information and is intended only for the
individual named. If you are not the named addressee you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately by e-mail if you have received this e-mail by mistake and delete
this e-mail from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in
reliance on the contents of this information is strictly prohibited.
Il giorno 19 set 2016, alle ore 05:22, Will.Boege
<[email protected]<mailto:[email protected]>> ha scritto:
How many PGs do you have - and how many are you increasing it to?
Increasing PG counts can be disruptive if you are increasing by a large
proportion of the initial count because all the PG peering involved. If you
are doubling the amount of PGs it might be good to do it in stages to minimize
peering. For example if you are going from 1024 to 2048 - consider 4 increases
of 256, allowing the cluster to stabilize in-between, rather that one event
that doubles the number of PGs.
If you expect this cluster to grow, overshoot the recommended PG count by 50%
or so. This will allow you to minimize the PG increase events, and thusly
impact to your users.
From: ceph-users
<[email protected]<mailto:[email protected]>>
on behalf of Matteo Dacrema <[email protected]<mailto:[email protected]>>
Date: Sunday, September 18, 2016 at 3:29 PM
To: Goncalo Borges
<[email protected]<mailto:[email protected]>>,
"[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Subject: [EXTERNAL] Re: [ceph-users] Increase PG number
Hi , thanks for your reply.
Yes, I’don’t any near full osd.
The problem is not the rebalancing process but the process of creation of new
pgs.
I’ve only 2 host running Ceph Firefly version with 3 SSDs for journaling each.
During the creation of new pgs all the volumes attached stop to read or write
showing high iowait.
Ceph -s tell me that there are thousand of slow requests.
When all the pgs are created slow request begin to decrease and the cluster
start rebalancing process.
Matteo
This email and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed. If
you have received this email in error please notify the system manager. This
message contains confidential information and is intended only for the
individual named. If you are not the named addressee you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately by e-mail if you have received this e-mail by mistake and delete
this e-mail from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in
reliance on the contents of this information is strictly prohibited.
Il giorno 18 set 2016, alle ore 13:08, Goncalo Borges
<[email protected]<mailto:[email protected]>> ha scritto:
Hi
I am assuming that you do not have any near full osd (either before or along
the pg splitting process) and that your cluster is healthy.
To minimize the impact on the clients during recover or operations like pg
splitting, it is good to set the following configs. Obviously the whole
operation will take longer to recover but the impact on clients will be
minimized.
# ceph daemon mon.rccephmon1 config show | egrep
"(osd_max_backfills|osd_recovery_threads|osd_recovery_op_priority|osd_client_op_priority|osd_recovery_max_active)"
"osd_max_backfills": "1",
"osd_recovery_threads": "1",
"osd_recovery_max_active": "1"
"osd_client_op_priority": "63",
"osd_recovery_op_priority": "1"
Cheers
G.
________________________________________
From: ceph-users
[[email protected]<mailto:[email protected]>]
on behalf of Matteo Dacrema [[email protected]<mailto:[email protected]>]
Sent: 18 September 2016 03:42
To: [email protected]<mailto:[email protected]>
Subject: [ceph-users] Increase PG number
Hi All,
I need to expand my ceph cluster and I also need to increase pg number.
In a test environment I see that during pg creation all read and write
operations are stopped.
Is that a normal behavior ?
Thanks
Matteo
This email and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed. If
you have received this email in error please notify the system manager. This
message contains confidential information and is intended only for the
individual named. If you are not the named addressee you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately by e-mail if you have received this e-mail by mistake and delete
this e-mail from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in
reliance on the contents of this information is strictly prohibited.
--
Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto.
Seguire il link qui sotto per segnalarlo come spam:
http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=D6CF2401EE.A1426
--
Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto.
Clicca qui per segnalarlo come
spam.<http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=3FE154026D.A7030>
Clicca qui per metterlo in
blacklist<http://mx01.enter.it/cgi-bin/learn-msg.cgi?blacklist=1&id=3FE154026D.A7030>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com