[ceph-users] Re: cephadm bootstraps cluster with bad CRUSH map(?)

Anthony D'Atri Mon, 20 May 2024 10:14:17 -0700

> 
>>> This has left me with a single sad pg:
>>> [WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive
>>>    pg 1.0 is stuck inactive for 33m, current state unknown, last acting []
>>> 
>> .mgr pool perhaps.
> 
> I think so
> 
>>> ceph osd tree shows that CRUSH picked up my racks OK, eg.
>>> -3          45.11993  rack B4
>>> -2          45.11993      host moss-be1001
>>> 1    hdd    3.75999          osd.1             up   1.00000  1.00000
>> Please send the entire first 10 lines or so of `ceph osd tree`
> 
> root@moss-be1001:/# ceph osd tree
> ID  CLASS  WEIGHT     TYPE NAME             STATUS  REWEIGHT  PRI-AFF
> -7         176.11194  rack F3
> -6         176.11194      host moss-be1003
> 2    hdd    7.33800          osd.2             up   1.00000  1.00000
> 3    hdd    7.33800          osd.3             up   1.00000  1.00000
> 6    hdd    7.33800          osd.6             up   1.00000  1.00000
> 9    hdd    7.33800          osd.9             up   1.00000  1.00000
> 12    hdd    7.33800          osd.12            up   1.00000  1.00000
> 13    hdd    7.33800          osd.13            up   1.00000  1.00000
> 16    hdd    7.33800          osd.16            up   1.00000  1.00000
> 19    hdd    7.33800          osd.19            up   1.00000  1.00000


Yep.  Your racks and thus hosts and OSDs aren’t under the `default` or any 
other root, so they won’t get picked by any CRUSH rule. 

> 
>>> 
>>> I passed this config to bootstrap with --config:
>>> 
>>> [global]
>>>  osd_crush_chooseleaf_type = 3
>> Why did you set that?  3 is an unusual value.  AIUI most of the time the 
>> only reason to change this option is if one is setting up a single-node 
>> sandbox - and perhaps localpools create a rule using it.  I suspect this is 
>> at least part of your problem.
> 
> I wanted to have rack as failure domain rather than host i.e. to ensure that 
> each replica goes in a different rack (academic at the moment as I have 3 
> hosts, one in each rack, but for future expansion important).

You do that with the CRUSH rule, not with osd_crush_chooseleaf_type.  Set that 
back to the default value of `1`.  This option is marked `dev` for a reason ;)

And the replication rule:
rule replicated_rule {
       id 0
       type replicated
       step take default
       step chooseleaf firstn 0 type rack         ######    `rack` here is what 
selects the failure domain.
       step emit
}


> I could presumably fix this up by editing the crushmap (to put the racks into 
> the default bucket)

That would probably help 

        `ceph osh crush move F3 root=default`

but I think you’d also need to revert `osd_crush_chooseleaf_type` too.  Might 
be better to wipe and redeploy so you know that down the road when you add / 
replace hardware this behavior doesn’t resurface.


> 
>>> Once the cluster was up I used an osd spec file that looked like:
>>> service_type: osd
>>> service_id: rrd_single_NVMe
>>> placement:
>>>  label: "NVMe"
>>> spec:
>>>  data_devices:
>>>    rotational: 1
>>>  db_devices:
>>>    model: "NVMe"
>> Is it your intent to use spinners for payload data and SSD for metadata?
> 
> Yes.

You might want to set `db_slots` accordingly, by default I think it’ll be 1:1 
which probably isn’t what you intend.

> 
> Regards,
> 
> Matthew
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: cephadm bootstraps cluster with bad CRUSH map(?)

Reply via email to