Re: [ceph-users] PG num calculator live on Ceph.com

Sanders, Bill Wed, 07 Jan 2015 21:52:32 -0800

Excellent, thanks for the detailed breakdown.

Take care,
Bill
________________________________
From: Michael J. Kidd [michael.k...@inktank.com]
Sent: Wednesday, January 07, 2015 4:50 PM
To: Sanders, Bill
Cc: Loic Dachary; ceph-us...@ceph.com
Subject: Re: [ceph-users] PG num calculator live on Ceph.com

Hello Bill,
  Either 2048 or 4096 should be acceptable.  4096 gives about a 300 PG per OSD 
ratio, which would leave room for tripling the OSD count without needing to 
increase the PG number.  While 2048 gives about 150 PGs per OSD, not leaving 
room but for about a 50% OSD count expansion.

The high PG count per OSD issue really doesn't manifest aggressively until you 
get around 1000 PGs per OSD and beyond.  At those levels, steady state 
operation continues without issue.. but recovery within the cluster will see 
the memory utilization of the OSDs climb and could push into out of memory 
conditions on the OSD host (or at a minimum, heavy swap usage if enabled).  It 
still depends of course on the # of OSDs per node, and the amount of memory on 
the node as to if you'll actually experience issues or not.

As an example though, I worked on a cluster which was about 5500 PGs per OSD.  
The cluster experienced a network config issue in the switchgear which isolated 
2/3's of the OSD nodes from each other and the other 1/3 of the cluster.  When 
the network issue was cleared, the OSDs started dropping like flies... They'd 
start up, spool up the memory they needed for map update parsing, and get 
killed before making any real headway.  We were finally able to get the cluster 
online by limiting what the OSDs were doing to a small slice of the normal 
start-up, waiting for the OSDs to calm down, then opening up a bit more for 
them to do (noup, noin, norecover, nobackfill, pause, noscrub, nodeep-scrub 
were all set, and then unset one at a time until all OSDs were up/in and able 
to handle the recovery).

6 weeks later, that same cluster lost about 40% of the OSDs during a power 
outage due to corruption from an HBA bug.. (it didn't flush the write cache to 
disk).  This pushed the PG per OSD count over 9000!!  It simply couldn't 
recover with the available memory at that PG count.  Each OSD, started by 
itself, would consume > 60gb of RAM and get killed (the nodes only had 64gb 
total).

While this is an extreme example... we see cases generated with > 1000 PGs per 
OSD on a regular basis.  This is the type of thing we're trying to head off.

It should be noted that you can increase the PG num of a pool.. but cannot 
decrease!   The only way to reduce your cluster PG count is to create new 
smaller PG num pools, migrate the data and then delete the old, high PG count 
pools.  You could also simply add more OSDs to reduce the PG per OSD ratio.

The issue with too few PGs is poor data distribution.  So it's all about having 
enough PGs to get good data distribution without going too high and having 
resource exhaustion during recovery.

Hope this helps put things into perspective.

Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services
 - by Red Hat

On Wed, Jan 7, 2015 at 4:34 PM, Sanders, Bill 
<bill.sand...@teradata.com<mailto:bill.sand...@teradata.com>> wrote:
This is interesting.  Kudos to you guys for getting the calculator up, I think 
this'll help some folks.

I have 1 pool, 40 OSDs, and replica of 3.  I based my PG count on: 
http://ceph.com/docs/master/rados/operations/placement-groups/

'''
Less than 5 OSDs set pg_num to 128
Between 5 and 10 OSDs set pg_num to 512
Between 10 and 50 OSDs set pg_num to 4096
'''

But the calculator gives a different result of 2048.  Out of curiosity, what 
sorts of issues might one encounter by having too many placement groups?  I 
understand there's some resource overhead.  I don't suppose it would manifest 
itself in a recognizable way?

Bill

________________________________
From: ceph-users 
[ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>] 
on behalf of Michael J. Kidd 
[michael.k...@inktank.com<mailto:michael.k...@inktank.com>]
Sent: Wednesday, January 07, 2015 3:51 PM
To: Loic Dachary
Cc: ceph-us...@ceph.com<mailto:ceph-us...@ceph.com>
Subject: Re: [ceph-users] PG num calculator live on Ceph.com

> Where is the source ?
On the page.. :)  It does link out to jquery and jquery-ui, but all the custom 
bits are embedded in the HTML.

Glad it's helpful :)

Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services
 - by Red Hat

On Wed, Jan 7, 2015 at 3:46 PM, Loic Dachary 
<l...@dachary.org<mailto:l...@dachary.org>> wrote:

On 07/01/2015 23:08, Michael J. Kidd wrote:
> Hello all,
>   Just a quick heads up that we now have a PG calculator to help determine 
> the proper PG per pool numbers to achieve a target PG per OSD ratio.
>
> http://ceph.com/pgcalc
>
> Please check it out!  Happy to answer any questions, and always welcome any 
> feedback on the tool / verbiage, etc...

Great work ! That will be immensely useful :-)

Where is the source ?

Cheers

>
> As an aside, we're also working to update the documentation to reflect the 
> best practices.  See Ceph.com tracker for this at:
> http://tracker.ceph.com/issues/9867
>
> Thanks!
> Michael J. Kidd
> Sr. Storage Consultant
> Inktank Professional Services
>  - by Red Hat
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

--
Loïc Dachary, Artisan Logiciel Libre

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PG num calculator live on Ceph.com

Reply via email to