Hi Peter,

Just to check if your problem is similar to mine:

   - Do you have any pools that follow a crush rule to only use osds that
   are backed by hdds (i.e not nvmes)?
   - Do these pools obey that rule? i.e do they maybe have pgs that are on
   nvmes?

Regards,
Tom

On Fri, Jan 26, 2018 at 11:48 AM, Peter Linder <[email protected]>
wrote:

> Hi Thomas,
>
> No, we haven't gotten any closer to resolving this, in fact we had another
> issue again when we added a new nvme drive to our nvme servers (storage11,
> storage12 and storage13) that had weight 1.7 instead of the usual 0.728
> size. This (see below) is what a nvme and hdd server pair at a site looks
> like, and it broke when adding osd.10 (adding the nvme drive to storage12
> and storage13 worked, it failed when adding the last one to storage11).
> Changing osd.10's weight to 1.0 instead and recompiling crushmap allowed
> all PGs to activate.
>
> Unfortunately this is a production cluster that we were hoping to expand
> as needed, so if there is a problem we quickly have to revert to the last
> working crushmap, so no time to debug :(
>
> We are currently building a copy of the environment though virtualized and
> I hope that we will be able to re-create the issue there as we will be able
> to break it at will :)
>
>
> host storage11 {
>         id -5           # do not change unnecessarily
>         id -6 class nvme                # do not change unnecessarily
>         id -10 class hdd                # do not change unnecessarily
>         # weight 4.612
>         alg straw2
>         hash 0  # rjenkins1
>         item osd.0 weight 0.728
>         item osd.3 weight 0.728
>         item osd.6 weight 0.728
>         item osd.7 weight 0.728
>         item osd.10 weight 1.700
> }
> host storage21 {
>         id -13          # do not change unnecessarily
>         id -14 class nvme               # do not change unnecessarily
>         id -15 class hdd                # do not change unnecessarily
>         # weight 65.496
>         alg straw2
>         hash 0  # rjenkins1
>         item osd.12 weight 5.458
>         item osd.13 weight 5.458
>         item osd.14 weight 5.458
>         item osd.15 weight 5.458
>         item osd.16 weight 5.458
>         item osd.17 weight 5.458
>         item osd.18 weight 5.458
>         item osd.19 weight 5.458
>         item osd.20 weight 5.458
>         item osd.21 weight 5.458
>         item osd.22 weight 5.458
>         item osd.23 weight 5.458
> }
>
>
> Den 2018-01-26 kl. 08:45, skrev Thomas Bennett:
>
> Hi Peter,
>
> Not sure if you have got to the bottom of your problem,  but I seem to
> have found what might be a similar problem. I recommend reading below,  as
> there could be a potential hidden problem.
>
> Yesterday our cluster went into *HEALTH_WARN* state and I noticed that
> one of my pg's was listed as '*activating*' and marked as '*inactive*'
> and '*unclean*'.
>
> We also have a mixed OSD system - 768 HDDs and 16 NVMEs with three crush
> rules for object placement: the default *replicated_rule* (I never
> deleted it) and then two new ones for *replicate_rule_hdd* and
> *replicate_rule_nvme.*
>
> Running a query on the pg (in my case pg 15.792) did not yield anything
> out of place, except for it telling me that that it's state was '
> *activating*' (that's not even a pg state: pg states
> <http://docs.ceph.com/docs/master/rados/operations/pg-states/>) and made
> me slightly alarmed.
>
> The bits of information that alerted me to the issue where:
>
> 1. Running 'ceph pg dump' and finding the 'activating' pg showed the
> following information:
>
> 15.792 activating [4,724,242] #for pool 15 pg there are osds 4,724,242
>
>
> 2. Running 'ceph osd tree | grep 'osd.4 ' and getting the following
> information:
>
> 4 nvme osd.4
>
> 3. Now checking what pool 15 is by running 'ceph osd pool ls detail':
>
> pool 15 'default.rgw.data' replicated size 3 min_size 2 crush_rule 1
>
>
> These three bits of information made me realise what was going on:
>
>    - OSD 4,724,242 are all nvmes
>    - Pool 15 should obey crush_rule 1 (*replicate_rule_hdd)*
>    - Pool 15 has pgs that use nvmes!
>
> I found the following really useful tool online which showed me the depth
> of the problem: Get the Number of Placement Groups Per Osd
> <http://cephnotes.ksperis.com/blog/2015/02/23/get-the-number-of-placement-groups-per-osd>
>
> So it turns out in my case pool 15 has osds in all the nvmes!
>
> To test a fix to mimic the problem again - I executed the following
> command: 'ceph osd pg-upmap-items 15.792 4 22 724 67 76 242'
>
> It remap the osds used by the 'activating' pg and my cluster status when
> back to *HEALTH_OK *and the pg went back to normal making the cluster
> appear healthy.
>
> Luckily for me we've not put the cluster into production so I'll just blow
> away the pool and recreate it.
>
> What I've not yet figured out is how this happened.
>
> The steps (I think) I took where:
>
>    1. Run ceph-ansible and  'default.rgw.data' pool was created
>    automatically.
>    2. I think I then increased the pg count.
>    3. Create a new rule: ceph osd crush rule create-replicated
>    replicated_rule_hdd default host hdd
>    4. Move pool to new rule: ceph osd pool set default.rgw.data crush_rule
>    replicated_rule_hdd
>
> I don't know what the expected behaviour of the set command is, so I'm
> planing to see if I can recreate the problem on a test cluster to see which
> part of the process created the problem. Perhaps I should have first
> migrated to the new rule before increasing the pgs.
>
> Regards,
> Tom
>
> On Sat, Jan 20, 2018 at 10:30 PM, <[email protected]> wrote:
>
>> Hi all,
>>
>> I'm getting such weird problems when we for instance re-add a server, add
>> disks etc! Most of the time some PGs end up in "active+clean+remapped"
>> mode, but today some of them got stuck "activating" which meant that some
>> PGs were offline for a while. I'm able to fix things, but the fix is so
>> weird that I'm wondering whats going on...
>>
>> Background is we have a pool (rep=3,min=2) where for each pg we select 1
>> osd from a server with only nvme-osds, and 2 osds from servers with only
>> hdd's. There are a total of 9 servers, with 3 (1 nvme + 2 hdd) in 3
>> separate data centers. We always select servers from different data centers
>> (latency is not an issue), so we would select for instance dc2:nvme,
>> dc1.hdd, dc3:hdd, in 3 separate permutations.
>>
>> Here is the relevant part of our crushmap. I will explain layout and my
>> fix (that I have no idea why I'm doing) below it:
>>
>> hostgroup hg1-1 {
>>         id -30          # do not change unnecessarily
>>         id -28 class nvme               # do not change unnecessarily
>>         id -54 class hdd                # do not change unnecessarily
>>         id -71 class ssd                # do not change unnecessarily
>>         # weight 2.911
>>         alg straw2
>>         hash 0  # rjenkins1
>>         item storage11 weight 2.911
>> }
>> hostgroup hg1-2 {
>> <https://maps.google.com/?q=2+%7B+%0D+%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0+id+-31&entry=gmail&source=g>
>>         id -31          # do not change unnecessarily
>>         id -29 class nvme               # do not change unnecessarily
>>         id -55 class hdd                # do not change unnecessarily
>>         id -73 class ssd                # do not change unnecessarily
>>         # weight 65.789
>>         alg straw2
>>         hash 0  # rjenkins1
>>         item storage22 weight 65.789
>> }
>> hostgroup hg1-3 {
>>         id -32          # do not change unnecessarily
>>         id -43 class nvme               # do not change unnecessarily
>>         id -56 class hdd                # do not change unnecessarily
>>         id -75 class ssd                # do not change unnecessarily
>>         # weight 65.789
>>         alg straw2
>>         hash 0  # rjenkins1
>>         item storage23 weight 65.789
>> }
>> hostgroup hg2-1 {
>> <https://maps.google.com/?q=1+%7B+%0D+%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0+id+-33&entry=gmail&source=g>
>>         id -33          # do not change unnecessarily
>>         id -45 class nvme               # do not change unnecessarily
>>         id -58 class hdd                # do not change unnecessarily
>>         id -78 class ssd                # do not change unnecessarily
>>         # weight 2.911
>>         alg straw2
>>         hash 0  # rjenkins1
>>         item storage12 weight 2.911
>> }
>> hostgroup hg2-2 {
>> <https://maps.google.com/?q=2+%7B+%0D+%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0+id+-34&entry=gmail&source=g>
>>         id -34          # do not change unnecessarily
>>         id -46 class nvme               # do not change unnecessarily
>>         id -59 class hdd                # do not change unnecessarily
>>         id -80 class ssd                # do not change unnecessarily
>>         # weight 65.496
>>         alg straw2
>>         hash 0  # rjenkins1
>>         item storage21 weight 65.496
>> }
>> hostgroup hg2-3 {
>>         id -35          # do not change unnecessarily
>>         id -47 class nvme               # do not change unnecessarily
>>         id -60 class hdd                # do not change unnecessarily
>>         id -81 class ssd                # do not change unnecessarily
>>         # weight 65.789
>>         alg straw2
>>         hash 0  # rjenkins1
>>         item storage23 weight 65.789
>> }
>> hostgroup hg3-1 {
>> <https://maps.google.com/?q=1+%7B+%0D+%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0+id+-36&entry=gmail&source=g>
>>         id -36          # do not change unnecessarily
>>         id -49 class nvme               # do not change unnecessarily
>>         id -62 class hdd                # do not change unnecessarily
>>         id -84 class ssd                # do not change unnecessarily
>>         # weight 2.911
>>         alg straw2
>>         hash 0  # rjenkins1
>>         item storage13 weight 2.911
>> }
>> hostgroup hg3-2 {
>> <https://maps.google.com/?q=2+%7B+%0D+%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0+id+-37&entry=gmail&source=g>
>>         id -37          # do not change unnecessarily
>>         id -50 class nvme               # do not change unnecessarily
>>         id -63 class hdd                # do not change unnecessarily
>>         id -85 class ssd                # do not change unnecessarily
>>         # weight 65.496
>>         alg straw2
>>         hash 0  # rjenkins1
>>         item storage21 weight 65.496
>> }
>> hostgroup hg3-3 {
>> <https://maps.google.com/?q=3+%7B+%0D+%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0+id+-38&entry=gmail&source=g>
>>         id -38          # do not change unnecessarily
>>         id -51 class nvme               # do not change unnecessarily
>>         id -64 class hdd                # do not change unnecessarily
>>         id -86 class ssd                # do not change unnecessarily
>>         # weight 65.789
>>         alg straw2
>>         hash 0  # rjenkins1
>>         item storage22 weight 65.789
>> }
>> datacenter ldc1 {
>>         id -39          # do not change unnecessarily
>>         id -44 class nvme               # do not change unnecessarily
>>         id -57 class hdd                # do not change unnecessarily
>>         id -76 class ssd                # do not change unnecessarily
>>         # weight 134.489
>>         alg straw2
>>         hash 0  # rjenkins1
>>         item hg1-1 weight 65.496
>>         item hg1-2 weight 65.789
>>         item hg1-3 weight 65.789
>> }
>> datacenter ldc2 {
>>         id -40          # do not change unnecessarily
>>         id -48 class nvme               # do not change unnecessarily
>>         id -61 class hdd                # do not change unnecessarily
>>         id -82 class ssd                # do not change unnecessarily
>>         # weight 196.781
>>         alg straw2
>>         hash 0  # rjenkins1
>>         item hg2-1 weight 65.496
>>         item hg2-2 weight 65.496
>>         item hg2-3 weight 65.789
>> }
>> datacenter ldc3 {
>>         id -41          # do not change unnecessarily
>>         id -52 class nvme               # do not change unnecessarily
>>         id -65 class hdd                # do not change unnecessarily
>>         id -87 class ssd                # do not change unnecessarily
>>         # weight 197.197
>>         alg straw2
>>         hash 0  # rjenkins1
>>         item hg3-1 weight 65.912
>>         item hg3-2 weight 65.496
>>         item hg3-3 weight 65.789
>> }
>> root ldc {
>>         id -42          # do not change unnecessarily
>>         id -53 class nvme               # do not change unnecessarily
>>         id -66 class hdd                # do not change unnecessarily
>>         id -88 class ssd                # do not change unnecessarily
>>
>>         # weight 528.881
>>         alg straw2
>>         hash 0  # rjenkins1
>>         item ldc1 weight 97.489
>>         item ldc2 weight 97.196
>>         item ldc3 weight 97.196
>> }
>>
>> # rules
>> rule hybrid {
>>         id 1
>>         type replicated
>>         min_size 1
>>         max_size 10
>>         step take ldc
>>         step choose firstn 1 type datacenter
>>         step chooseleaf firstn 0 type hostgroup
>>         step emit
>> }
>>
>>
>> Ok, so there are 9 hostgroups (i changed "type 2"). Each hostgroup
>> currently holds 1 server, but may in the future hold more. These are
>> grouped in 3, and called a "datacenter" even though the set is spread out
>> onto 3 physical data centers. These are then put in a separate root called
>> "ldc".
>>
>> The "hybrid" rule then proceeds to select 1 datacenter, and then 3 osds
>> from that datacenter. The end result is that 3 OSDs from different physical
>> datacenters are selected, with 1 nvme and 2 hdd (hdds have reduced primary
>> affinity to 0.00099, and yes this might be a problem?). If one datacenter
>> is lost, only 1/3'rd of the nvmes are in fact offline so capacity loss is
>> manageable compared to having all nvme's in one datacenter.
>>
>> Because nvmes are much smaller, after adding one the "datacenter" looks
>> like this:
>>
>>         item hg1-1 weight 2.911
>>         item hg1-2 weight 65.789
>>         item hg1-3 weight 65.789
>>
>> This causes PGs to go into "active+clean+remapped" state forever. If I
>> manually change the weights so that they are all almost the same, the
>> problem goes away! I would have though that the weights does not matter,
>> since we have to choose 3 of these anyways. So I'm really confused over
>> this.
>>
>> Today I also had to change
>>
>>         item ldc1 weight 197.489
>>         item ldc2 weight 197.196
>>         item ldc3 weight 197.196
>> to
>>         item ldc1 weight 97.489
>>         item ldc2 weight 97.196
>>         item ldc3 weight 97.196
>>
>> or some PGs wouldn't activate at all! I'm really not aware how the
>> hashing/selection process works though, it does somehow seem that if the
>> values are too far apart, things seem to break. crushtool --test seems to
>> correctly calculate my PGs.
>>
>> Basically when this happens I just randomly change some weights and most
>> of the time it starts working. Why?
>>
>> Regards,
>> Peter
>>
>>
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Thomas Bennett
>
> SKA South Africa
> Science Processing Team
>
> Office: +27 21 5067341 <021%20506%207341>
> Mobile: +27 79 5237105 <079%20523%207105>
>
>
>


-- 
Thomas Bennett

SKA South Africa
Science Processing Team

Office: +27 21 5067341
Mobile: +27 79 5237105
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to