hdd pool

Thomas Bennett Thu, 25 Jan 2018 23:46:55 -0800

Hi Peter,

Not sure if you have got to the bottom of your problem,  but I seem to have
found what might be a similar problem. I recommend reading below,  as there
could be a potential hidden problem.


Yesterday our cluster went into *HEALTH_WARN* state and I noticed that one
of my pg's was listed as '*activating*' and marked as '*inactive*' and '
*unclean*'.

We also have a mixed OSD system - 768 HDDs and 16 NVMEs with three crush
rules for object placement: the default *replicated_rule* (I never deleted
it) and then two new ones for *replicate_rule_hdd* and
*replicate_rule_nvme.*

Running a query on the pg (in my case pg 15.792) did not yield anything out
of place, except for it telling me that that it's state was '*activating*'
(that's not even a pg state: pg states
<http://docs.ceph.com/docs/master/rados/operations/pg-states/>) and made me
slightly alarmed.

The bits of information that alerted me to the issue where:

1. Running 'ceph pg dump' and finding the 'activating' pg showed the
following information:

15.792 activating [4,724,242] #for pool 15 pg there are osds 4,724,242


2. Running 'ceph osd tree | grep 'osd.4 ' and getting the following
information:

4 nvme osd.4

3. Now checking what pool 15 is by running 'ceph osd pool ls detail':

pool 15 'default.rgw.data' replicated size 3 min_size 2 crush_rule 1


These three bits of information made me realise what was going on:

   - OSD 4,724,242 are all nvmes
   - Pool 15 should obey crush_rule 1 (*replicate_rule_hdd)*
   - Pool 15 has pgs that use nvmes!

I found the following really useful tool online which showed me the depth
of the problem: Get the Number of Placement Groups Per Osd
<http://cephnotes.ksperis.com/blog/2015/02/23/get-the-number-of-placement-groups-per-osd>

So it turns out in my case pool 15 has osds in all the nvmes!

To test a fix to mimic the problem again - I executed the following
command: 'ceph osd pg-upmap-items 15.792 4 22 724 67 76 242'

It remap the osds used by the 'activating' pg and my cluster status when
back to *HEALTH_OK *and the pg went back to normal making the cluster
appear healthy.

Luckily for me we've not put the cluster into production so I'll just blow
away the pool and recreate it.

What I've not yet figured out is how this happened.

The steps (I think) I took where:

   1. Run ceph-ansible and  'default.rgw.data' pool was created
   automatically.
   2. I think I then increased the pg count.
   3. Create a new rule: ceph osd crush rule create-replicated
   replicated_rule_hdd default host hdd
   4. Move pool to new rule: ceph osd pool set default.rgw.data crush_rule
   replicated_rule_hdd

I don't know what the expected behaviour of the set command is, so I'm
planing to see if I can recreate the problem on a test cluster to see which
part of the process created the problem. Perhaps I should have first
migrated to the new rule before increasing the pgs.

Regards,
Tom

On Sat, Jan 20, 2018 at 10:30 PM, <[email protected]> wrote:

> Hi all,
>
> I'm getting such weird problems when we for instance re-add a server, add
> disks etc! Most of the time some PGs end up in "active+clean+remapped"
> mode, but today some of them got stuck "activating" which meant that some
> PGs were offline for a while. I'm able to fix things, but the fix is so
> weird that I'm wondering whats going on...
>
> Background is we have a pool (rep=3,min=2) where for each pg we select 1
> osd from a server with only nvme-osds, and 2 osds from servers with only
> hdd's. There are a total of 9 servers, with 3 (1 nvme + 2 hdd) in 3
> separate data centers. We always select servers from different data centers
> (latency is not an issue), so we would select for instance dc2:nvme,
> dc1.hdd, dc3:hdd, in 3 separate permutations.
>
> Here is the relevant part of our crushmap. I will explain layout and my
> fix (that I have no idea why I'm doing) below it:
>
> hostgroup hg1-1 {
>         id -30          # do not change unnecessarily
>         id -28 class nvme               # do not change unnecessarily
>         id -54 class hdd                # do not change unnecessarily
>         id -71 class ssd                # do not change unnecessarily
>         # weight 2.911
>         alg straw2
>         hash 0  # rjenkins1
>         item storage11 weight 2.911
> }
> hostgroup hg1-2 {
>         id -31          # do not change unnecessarily
>         id -29 class nvme               # do not change unnecessarily
>         id -55 class hdd                # do not change unnecessarily
>         id -73 class ssd                # do not change unnecessarily
>         # weight 65.789
>         alg straw2
>         hash 0  # rjenkins1
>         item storage22 weight 65.789
> }
> hostgroup hg1-3 {
>         id -32          # do not change unnecessarily
>         id -43 class nvme               # do not change unnecessarily
>         id -56 class hdd                # do not change unnecessarily
>         id -75 class ssd                # do not change unnecessarily
>         # weight 65.789
>         alg straw2
>         hash 0  # rjenkins1
>         item storage23 weight 65.789
> }
> hostgroup hg2-1 {
>         id -33          # do not change unnecessarily
>         id -45 class nvme               # do not change unnecessarily
>         id -58 class hdd                # do not change unnecessarily
>         id -78 class ssd                # do not change unnecessarily
>         # weight 2.911
>         alg straw2
>         hash 0  # rjenkins1
>         item storage12 weight 2.911
> }
> hostgroup hg2-2 {
>         id -34          # do not change unnecessarily
>         id -46 class nvme               # do not change unnecessarily
>         id -59 class hdd                # do not change unnecessarily
>         id -80 class ssd                # do not change unnecessarily
>         # weight 65.496
>         alg straw2
>         hash 0  # rjenkins1
>         item storage21 weight 65.496
> }
> hostgroup hg2-3 {
>         id -35          # do not change unnecessarily
>         id -47 class nvme               # do not change unnecessarily
>         id -60 class hdd                # do not change unnecessarily
>         id -81 class ssd                # do not change unnecessarily
>         # weight 65.789
>         alg straw2
>         hash 0  # rjenkins1
>         item storage23 weight 65.789
> }
> hostgroup hg3-1 {
>         id -36          # do not change unnecessarily
>         id -49 class nvme               # do not change unnecessarily
>         id -62 class hdd                # do not change unnecessarily
>         id -84 class ssd                # do not change unnecessarily
>         # weight 2.911
>         alg straw2
>         hash 0  # rjenkins1
>         item storage13 weight 2.911
> }
> hostgroup hg3-2 {
>         id -37          # do not change unnecessarily
>         id -50 class nvme               # do not change unnecessarily
>         id -63 class hdd                # do not change unnecessarily
>         id -85 class ssd                # do not change unnecessarily
>         # weight 65.496
>         alg straw2
>         hash 0  # rjenkins1
>         item storage21 weight 65.496
> }
> hostgroup hg3-3 {
>         id -38          # do not change unnecessarily
>         id -51 class nvme               # do not change unnecessarily
>         id -64 class hdd                # do not change unnecessarily
>         id -86 class ssd                # do not change unnecessarily
>         # weight 65.789
>         alg straw2
>         hash 0  # rjenkins1
>         item storage22 weight 65.789
> }
> datacenter ldc1 {
>         id -39          # do not change unnecessarily
>         id -44 class nvme               # do not change unnecessarily
>         id -57 class hdd                # do not change unnecessarily
>         id -76 class ssd                # do not change unnecessarily
>         # weight 134.489
>         alg straw2
>         hash 0  # rjenkins1
>         item hg1-1 weight 65.496
>         item hg1-2 weight 65.789
>         item hg1-3 weight 65.789
> }
> datacenter ldc2 {
>         id -40          # do not change unnecessarily
>         id -48 class nvme               # do not change unnecessarily
>         id -61 class hdd                # do not change unnecessarily
>         id -82 class ssd                # do not change unnecessarily
>         # weight 196.781
>         alg straw2
>         hash 0  # rjenkins1
>         item hg2-1 weight 65.496
>         item hg2-2 weight 65.496
>         item hg2-3 weight 65.789
> }
> datacenter ldc3 {
>         id -41          # do not change unnecessarily
>         id -52 class nvme               # do not change unnecessarily
>         id -65 class hdd                # do not change unnecessarily
>         id -87 class ssd                # do not change unnecessarily
>         # weight 197.197
>         alg straw2
>         hash 0  # rjenkins1
>         item hg3-1 weight 65.912
>         item hg3-2 weight 65.496
>         item hg3-3 weight 65.789
> }
> root ldc {
>         id -42          # do not change unnecessarily
>         id -53 class nvme               # do not change unnecessarily
>         id -66 class hdd                # do not change unnecessarily
>         id -88 class ssd                # do not change unnecessarily
>
>         # weight 528.881
>         alg straw2
>         hash 0  # rjenkins1
>         item ldc1 weight 97.489
>         item ldc2 weight 97.196
>         item ldc3 weight 97.196
> }
>
> # rules
> rule hybrid {
>         id 1
>         type replicated
>         min_size 1
>         max_size 10
>         step take ldc
>         step choose firstn 1 type datacenter
>         step chooseleaf firstn 0 type hostgroup
>         step emit
> }
>
>
> Ok, so there are 9 hostgroups (i changed "type 2"). Each hostgroup
> currently holds 1 server, but may in the future hold more. These are
> grouped in 3, and called a "datacenter" even though the set is spread out
> onto 3 physical data centers. These are then put in a separate root called
> "ldc".
>
> The "hybrid" rule then proceeds to select 1 datacenter, and then 3 osds
> from that datacenter. The end result is that 3 OSDs from different physical
> datacenters are selected, with 1 nvme and 2 hdd (hdds have reduced primary
> affinity to 0.00099, and yes this might be a problem?). If one datacenter
> is lost, only 1/3'rd of the nvmes are in fact offline so capacity loss is
> manageable compared to having all nvme's in one datacenter.
>
> Because nvmes are much smaller, after adding one the "datacenter" looks
> like this:
>
>         item hg1-1 weight 2.911
>         item hg1-2 weight 65.789
>         item hg1-3 weight 65.789
>
> This causes PGs to go into "active+clean+remapped" state forever. If I
> manually change the weights so that they are all almost the same, the
> problem goes away! I would have though that the weights does not matter,
> since we have to choose 3 of these anyways. So I'm really confused over
> this.
>
> Today I also had to change
>
>         item ldc1 weight 197.489
>         item ldc2 weight 197.196
>         item ldc3 weight 197.196
> to
>         item ldc1 weight 97.489
>         item ldc2 weight 97.196
>         item ldc3 weight 97.196
>
> or some PGs wouldn't activate at all! I'm really not aware how the
> hashing/selection process works though, it does somehow seem that if the
> values are too far apart, things seem to break. crushtool --test seems to
> correctly calculate my PGs.
>
> Basically when this happens I just randomly change some weights and most
> of the time it starts working. Why?
>
> Regards,
> Peter
>
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Thomas Bennett

SKA South Africa
Science Processing Team

Office: +27 21 5067341 <021%20506%207341>
Mobile: +27 79 5237105 <079%20523%207105>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Weird issues related to (large/small) weights in mixed nvme/hdd pool

Reply via email to