Re: [ceph-users] PG num calculator live on Ceph.com

Michael J. Kidd Wed, 07 Jan 2015 21:47:15 -0800

Hello Bill,
  Either 2048 or 4096 should be acceptable.  4096 gives about a 300 PG per
OSD ratio, which would leave room for tripling the OSD count without
needing to increase the PG number.  While 2048 gives about 150 PGs per OSD,
not leaving room but for about a 50% OSD count expansion.

The high PG count per OSD issue really doesn't manifest aggressively until
you get around 1000 PGs per OSD and beyond.  At those levels, steady state
operation continues without issue.. but recovery within the cluster will
see the memory utilization of the OSDs climb and could push into out of
memory conditions on the OSD host (or at a minimum, heavy swap usage if
enabled).  It still depends of course on the # of OSDs per node, and the
amount of memory on the node as to if you'll actually experience issues or
not.

As an example though, I worked on a cluster which was about 5500 PGs per
OSD.  The cluster experienced a network config issue in the switchgear
which isolated 2/3's of the OSD nodes from each other and the other 1/3 of
the cluster.  When the network issue was cleared, the OSDs started dropping
like flies... They'd start up, spool up the memory they needed for map
update parsing, and get killed before making any real headway.  We were
finally able to get the cluster online by limiting what the OSDs were doing
to a small slice of the normal start-up, waiting for the OSDs to calm down,
then opening up a bit more for them to do (noup, noin, norecover,
nobackfill, pause, noscrub, nodeep-scrub were all set, and then unset one
at a time until all OSDs were up/in and able to handle the recovery).

6 weeks later, that same cluster lost about 40% of the OSDs during a power
outage due to corruption from an HBA bug.. (it didn't flush the write cache
to disk).  This pushed the PG per OSD count over 9000!!  It simply couldn't
recover with the available memory at that PG count.  Each OSD, started by
itself, would consume > 60gb of RAM and get killed (the nodes only had 64gb
total).

While this is an extreme example... we see cases generated with > 1000 PGs
per OSD on a regular basis.  This is the type of thing we're trying to head
off.

It should be noted that you can increase the PG num of a pool.. but cannot
decrease!   The only way to reduce your cluster PG count is to create new
smaller PG num pools, migrate the data and then delete the old, high PG
count pools.  You could also simply add more OSDs to reduce the PG per OSD
ratio.

The issue with too few PGs is poor data distribution.  So it's all about
having enough PGs to get good data distribution without going too high and
having resource exhaustion during recovery.

Hope this helps put things into perspective.

Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services
 - by Red Hat

On Wed, Jan 7, 2015 at 4:34 PM, Sanders, Bill <bill.sand...@teradata.com>
wrote:

>  This is interesting.  Kudos to you guys for getting the calculator up, I
> think this'll help some folks.
>
> I have 1 pool, 40 OSDs, and replica of 3.  I based my PG count on:
> http://ceph.com/docs/master/rados/operations/placement-groups/
>
> '''
> Less than 5 OSDs set pg_num to 128
> Between 5 and 10 OSDs set pg_num to 512
> Between 10 and 50 OSDs set pg_num to 4096
> '''
>
> But the calculator gives a different result of 2048.  Out of curiosity,
> what sorts of issues might one encounter by having too many placement
> groups?  I understand there's some resource overhead.  I don't suppose it
> would manifest itself in a recognizable way?
>
> Bill
>
>  ------------------------------
> *From:* ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of
> Michael J. Kidd [michael.k...@inktank.com]
> *Sent:* Wednesday, January 07, 2015 3:51 PM
> *To:* Loic Dachary
> *Cc:* ceph-us...@ceph.com
> *Subject:* Re: [ceph-users] PG num calculator live on Ceph.com
>
>    > Where is the source ?
>  On the page.. :)  It does link out to jquery and jquery-ui, but all the
> custom bits are embedded in the HTML.
>
>  Glad it's helpful :)
>
>   Michael J. Kidd
> Sr. Storage Consultant
> Inktank Professional Services
>   - by Red Hat
>
> On Wed, Jan 7, 2015 at 3:46 PM, Loic Dachary <l...@dachary.org> wrote:
>
>>
>>
>> On 07/01/2015 23:08, Michael J. Kidd wrote:
>> > Hello all,
>> >   Just a quick heads up that we now have a PG calculator to help
>> determine the proper PG per pool numbers to achieve a target PG per OSD
>> ratio.
>> >
>> > http://ceph.com/pgcalc
>> >
>> > Please check it out!  Happy to answer any questions, and always welcome
>> any feedback on the tool / verbiage, etc...
>>
>> Great work ! That will be immensely useful :-)
>>
>> Where is the source ?
>>
>> Cheers
>>
>> >
>> > As an aside, we're also working to update the documentation to reflect
>> the best practices.  See Ceph.com tracker for this at:
>> > http://tracker.ceph.com/issues/9867
>> >
>> > Thanks!
>> > Michael J. Kidd
>> > Sr. Storage Consultant
>> > Inktank Professional Services
>> >  - by Red Hat
>> >
>> >
>>  > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>>  --
>> Loïc Dachary, Artisan Logiciel Libre
>>
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PG num calculator live on Ceph.com

Reply via email to