By sticking with PG numbers as a base 2 number (1024, 16384, etc) all of
your PGs will be the same size and easier to balance and manage.  What
happens when you have a non base 2 number is something like this.  Say you
have 4 PGs that are all 2GB in size.  If you increase pg(p)_num to 6, then
you will have 2 PGs that are 2GB and 4 PGs that are 1GB as you've split 2
of the PGs into 4 to get to the 6 total.  If you increase the pg(p)_num to
8, then all 8 PGs will be 1GB.  Depending on how you manage your cluster,
that doesn't really matter, but for some methods of balancing your cluster,
that will greatly imbalance things.

This would be a good time to go to a base 2 number.  I think you're
thinking about Gluster where if you have 4 bricks and you want to increase
your capacity, going to anything other than a multiple of 4 (8, 12, 16)
kills performance (worse than increasing storage already does) and takes
longer as it has to weirdly divide the data instead of splitting a single
brick up to multiple bricks.

As you increase your PGs, do this slowly and in a loop.  I like to increase
my PGs by 256, wait for all PGs to create, activate, and peer, rinse/repate
until I get to my target.  [1] This is an example of a script that should
accomplish this with no interference.  Notice the use of flags while
increasing the PGs.  It will make things take much longer if you have an
OSD OOM itself or die for any reason by adding to the peering needing to
happen.  It will also be wasted IO to start backfilling while you're still
making changes; it's best to wait until you finish increasing your PGs and
everything peers before you let data start moving.

Another thing to keep in mind is how long your cluster will be moving data
around.  Increasing your PG count on a pool full of data is one of the most
intensive operations you can tell a cluster to do.  The last time I had to
do this, I increased pg(p)_num by 4k PGs from 16k to 32k, let it backfill,
rinse/repeat until the desired PG count was achieved.  For me, that 4k PGs
would take 3-5 days depending on other cluster load and how full the
cluster was.  If you do decide to increase your PGs by 4k instead of the
full increase, change the 16384 to the number you decide to go to,
backfill, continue.

# Make sure to set pool variable as well as the number ranges to the
appropriate values.
flags="nodown nobackfill norecover"
for flag in $flags; do
  ceph osd set $flag
echo "$pool currently has $(ceph osd pool get $pool pg_num) PGs"
# The first number is your current PG count for the pool, the second number
is the target PG count, and the third number is how many to increase it by
each time through the loop.
for num in {7700..16384..256}; do
  ceph osd pool set $pool pg_num $num
  while sleep 10; do
    ceph osd health | grep -q
'peering\|stale\|activating\|creating\|inactive' || break
  ceph osd pool set $pool pgp_num $num
  while sleep 10; do
    ceph osd health | grep -q
'peering\|stale\|activating\|creating\|inactive' || break
for flag in $flags; do
  ceph osd unset $flag

On Thu, May 17, 2018 at 9:27 AM Kai Wagner <> wrote:

> Hi Oliver,
> a good value is 100-150 PGs per OSD. So in your case between 20k and 30k.
> You can increase your PGs, but keep in mind that this will keep the
> cluster quite busy for some while. That said I would rather increase in
> smaller steps than in one large move.
> Kai
> On 17.05.2018 01:29, Oliver Schulz wrote:
> > Dear all,
> >
> > we have a Ceph cluster that has slowly evolved over several
> > years and Ceph versions (started with 18 OSDs and 54 TB
> > in 2013, now about 200 OSDs and 1.5 PB, still the same
> > cluster, with data continuity). So there are some
> > "early sins" in the cluster configuration, left over from
> > the early days.
> >
> > One of these sins is the number of PGs in our CephFS "data"
> > pool, which is 7200 and therefore not (as recommended)
> > a power of two. Pretty much all of our data is in the
> > "data" pool, the only other pools are "rbd" and "metadata",
> > both contain little data (and they have way too many PGs
> > already, another early sin).
> >
> > Is it possible - and safe - to change the number of "data"
> > pool PGs from 7200 to 8192 or 16384? As we recently added
> > more OSDs, I guess it would be time to increase the number
> > of PGs anyhow. Or would we have to go to 14400 instead of
> > 16384?
> >
> >
> > Thanks for any advice,
> >
> > Oliver
> > _______________________________________________
> > ceph-users mailing list
> >
> >
> >
> --
> SUSE Linux GmbH, GF: Felix Imend├Ârffer, Jane Smithard, Graham Norton, HRB
> 21284 (AG N├╝rnberg)
> _______________________________________________
> ceph-users mailing list
ceph-users mailing list

Reply via email to