Great summary David. Wouldn't this be worth a blog post?
On 17.05.2018 20:36, David Turner wrote:
> By sticking with PG numbers as a base 2 number (1024, 16384, etc) all
> of your PGs will be the same size and easier to balance and manage.
> What happens when you have a non base 2 number is something like
> this. Say you have 4 PGs that are all 2GB in size. If you increase
> pg(p)_num to 6, then you will have 2 PGs that are 2GB and 4 PGs that
> are 1GB as you've split 2 of the PGs into 4 to get to the 6 total. If
> you increase the pg(p)_num to 8, then all 8 PGs will be 1GB.
> Depending on how you manage your cluster, that doesn't really matter,
> but for some methods of balancing your cluster, that will greatly
> imbalance things.
>
> This would be a good time to go to a base 2 number. I think you're
> thinking about Gluster where if you have 4 bricks and you want to
> increase your capacity, going to anything other than a multiple of 4
> (8, 12, 16) kills performance (worse than increasing storage already
> does) and takes longer as it has to weirdly divide the data instead of
> splitting a single brick up to multiple bricks.
>
> As you increase your PGs, do this slowly and in a loop. I like to
> increase my PGs by 256, wait for all PGs to create, activate, and
> peer, rinse/repate until I get to my target. [1] This is an example
> of a script that should accomplish this with no interference. Notice
> the use of flags while increasing the PGs. It will make things take
> much longer if you have an OSD OOM itself or die for any reason by
> adding to the peering needing to happen. It will also be wasted IO to
> start backfilling while you're still making changes; it's best to wait
> until you finish increasing your PGs and everything peers before you
> let data start moving.
>
> Another thing to keep in mind is how long your cluster will be moving
> data around. Increasing your PG count on a pool full of data is one
> of the most intensive operations you can tell a cluster to do. The
> last time I had to do this, I increased pg(p)_num by 4k PGs from 16k
> to 32k, let it backfill, rinse/repeat until the desired PG count was
> achieved. For me, that 4k PGs would take 3-5 days depending on other
> cluster load and how full the cluster was. If you do decide to
> increase your PGs by 4k instead of the full increase, change the 16384
> to the number you decide to go to, backfill, continue.
>
>
> [1]
> # Make sure to set pool variable as well as the number ranges to the
> appropriate values.
> flags="nodown nobackfill norecover"
> for flag in $flags; do
> ceph osd set $flag
> done
> pool=rbd
> echo "$pool currently has $(ceph osd pool get $pool pg_num) PGs"
> # The first number is your current PG count for the pool, the second
> number is the target PG count, and the third number is how many to
> increase it by each time through the loop.
> for num in {7700..16384..256}; do
> ceph osd pool set $pool pg_num $num
> while sleep 10; do
> ceph osd health | grep -q
> 'peering\|stale\|activating\|creating\|inactive' || break
> done
> ceph osd pool set $pool pgp_num $num
> while sleep 10; do
> ceph osd health | grep -q
> 'peering\|stale\|activating\|creating\|inactive' || break
> done
> done
> for flag in $flags; do
> ceph osd unset $flag
> done
>
> On Thu, May 17, 2018 at 9:27 AM Kai Wagner <[email protected]
> <mailto:[email protected]>> wrote:
>
> Hi Oliver,
>
> a good value is 100-150 PGs per OSD. So in your case between 20k
> and 30k.
>
> You can increase your PGs, but keep in mind that this will keep the
> cluster quite busy for some while. That said I would rather
> increase in
> smaller steps than in one large move.
>
> Kai
>
>
> On 17.05.2018 01:29, Oliver Schulz wrote:
> > Dear all,
> >
> > we have a Ceph cluster that has slowly evolved over several
> > years and Ceph versions (started with 18 OSDs and 54 TB
> > in 2013, now about 200 OSDs and 1.5 PB, still the same
> > cluster, with data continuity). So there are some
> > "early sins" in the cluster configuration, left over from
> > the early days.
> >
> > One of these sins is the number of PGs in our CephFS "data"
> > pool, which is 7200 and therefore not (as recommended)
> > a power of two. Pretty much all of our data is in the
> > "data" pool, the only other pools are "rbd" and "metadata",
> > both contain little data (and they have way too many PGs
> > already, another early sin).
> >
> > Is it possible - and safe - to change the number of "data"
> > pool PGs from 7200 to 8192 or 16384? As we recently added
> > more OSDs, I guess it would be time to increase the number
> > of PGs anyhow. Or would we have to go to 14400 instead of
> > 16384?
> >
> >
> > Thanks for any advice,
> >
> > Oliver
> > _______________________________________________
> > ceph-users mailing list
> > [email protected] <mailto:[email protected]>
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> --
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham
> Norton, HRB 21284 (AG Nürnberg)
>
>
> _______________________________________________
> ceph-users mailing list
> [email protected] <mailto:[email protected]>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284
(AG Nürnberg)
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
