Hi Eugen,
You say I don't have to worry about changing pg_num manualy. Makes sense. Does
this also count for pg_num_max? Will the pg_autoscaler also change this
parameter if nescessary?
Below the output you requested
root@hvs001:/# ceph -s
cluster:
id: dd4b0610-b4d2-11ec-bb58-d1b32ae31585
health: HEALTH_WARN
Reduced data availability: 64 pgs inactive
Degraded data redundancy: 68 pgs undersized
services:
mon: 3 daemons, quorum hvs001,hvs002,hvs003 (age 43h)
mgr: hvs001.baejuo(active, since 43h), standbys: hvs002.etijdk
osd: 6 osds: 6 up (since 25h), 6 in (since 4h); 4 remapped pgs
data:
pools: 2 pools, 68 pgs
objects: 2 objects, 705 KiB
usage: 134 MiB used, 1.7 TiB / 1.7 TiB avail
pgs: 94.118% pgs not active
4/6 objects misplaced (66.667%)
64 undersized+peered
4 active+undersized+remapped
progress:
Global Recovery Event (0s)
[............................]
root@hvs001:/# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 1.70193 root default
-3 0.00137 host hvs001
0 hdd 0.00069 osd.0 up 1.00000 1.00000
1 hdd 0.00069 osd.1 up 1.00000 1.00000
-5 1.69919 host hvs002
2 hdd 0.84959 osd.2 up 1.00000 1.00000
3 hdd 0.84959 osd.3 up 1.00000 1.00000
-7 0.00137 host hvs003
4 hdd 0.00069 osd.4 up 1.00000 1.00000
5 hdd 0.00069 osd.5 up 1.00000 1.00000
root@hvs001:/# ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins
pg_num 4 pgp_num 2 pg_num_target 1 pgp_num_target 1 autoscale_mode on
last_change 76 lfor 0/0/64 flags hashpspool stripe_width 0 pg_num_max 32
pg_num_min 1 application mgr
pool 2 'libvirt-pool' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 64 pgp_num 1 pgp_num_target 64 autoscale_mode on last_change 73
lfor 0/0/73 flags hashpspool stripe_width 0 pg_num_max 128 application rbd
root@hvs001:/# ceph osd crush rule dump
[
{
"rule_id": 0,
"rule_name": "replicated_rule",
"type": 1,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]
root@hvs001:~# tail /var/log/syslog
Apr 7 11:21:12 hvs001 bash[2670]: debug 2022-04-07T11:21:12.719+0000
7f51a7008700 0 log_channel(cluster) log [DBG] : pgmap v79196: 68 pgs: 64
undersized+peered, 4 active+undersized+remapped; 705 KiB data, 134 MiB used,
1.7 TiB / 1.7 TiB avail; 4/6 objects misplaced (66.667%)
Apr 7 11:21:13 hvs001 bash[2673]: level=error ts=2022-04-07T11:21:13.360Z
caller=notify.go:372 component=dispatcher msg="Error on notify" err="Post
https://10.3.1.23:8443//api/prometheus_receiver: x509: cannot validate
certificate for 10.3.1.23 because it doesn't contain any IP SANs"
context_err="context deadline exceeded"
Apr 7 11:21:13 hvs001 bash[2673]: level=error ts=2022-04-07T11:21:13.360Z
caller=notify.go:372 component=dispatcher msg="Error on notify" err="Post
https://hvs002.cometal.be:8443/api/prometheus_receiver: x509: certificate is
valid for ceph-dashboard, not hvs002.cometal.be" context_err="context deadline
exceeded"
Apr 7 11:21:13 hvs001 bash[2673]: level=error ts=2022-04-07T11:21:13.361Z
caller=dispatch.go:301 component=dispatcher msg="Notify for alerts failed"
num_alerts=1 err="Post https://10.3.1.23:8443//api/prometheus_receiver: x509:
cannot validate certificate for 10.3.1.23 because it doesn't contain any IP
SANs; Post https://hvs002.cometal.be:8443/api/prometheus_receiver: x509:
certificate is valid for ceph-dashboard, not hvs002.cometal.be"
Apr 7 11:21:14 hvs001 bash[2668]: cluster 2022-04-07T11:21:12.722206+0000
mgr.hvs001.baejuo (mgr.64107) 79190 : cluster [DBG] pgmap v79196: 68 pgs: 64
undersized+peered, 4 active+undersized+remapped; 705 KiB data, 134 MiB used,
1.7 TiB / 1.7 TiB avail; 4/6 objects misplaced (66.667%)
Apr 7 11:21:14 hvs001 bash[2670]: debug 2022-04-07T11:21:14.719+0000
7f51a7008700 0 log_channel(cluster) log [DBG] : pgmap v79197: 68 pgs: 64
undersized+peered, 4 active+undersized+remapped; 705 KiB data, 134 MiB used,
1.7 TiB / 1.7 TiB avail; 4/6 objects misplaced (66.667%)
Apr 7 11:21:15 hvs001 bash[2668]: cluster 2022-04-07T11:21:14.723411+0000
mgr.hvs001.baejuo (mgr.64107) 79191 : cluster [DBG] pgmap v79197: 68 pgs: 64
undersized+peered, 4 active+undersized+remapped; 705 KiB data, 134 MiB used,
1.7 TiB / 1.7 TiB avail; 4/6 objects misplaced (66.667%)
Apr 7 11:21:16 hvs001 bash[2668]: debug 2022-04-07T11:21:16.199+0000
7fc12252e700 1 mon.hvs001@0(leader).osd e87 _set_new_cache_sizes
cache_size:1020054731 inc_alloc: 348127232 full_alloc: 348127232 kv_alloc:
322961408
Apr 7 11:21:16 hvs001 bash[2670]: ::ffff:10.3.1.23 - - [07/Apr/2022:11:21:16]
"GET /metrics HTTP/1.1" 200 166748 "" "Prometheus/2.18.1"
Apr 7 11:21:16 hvs001 bash[2670]: debug 2022-04-07T11:21:16.267+0000
7f514f9b2700 0 [prometheus INFO cherrypy.access.139987709758544]
::ffff:10.3.1.23 - - [07/Apr/2022:11:21:16] "GET /metrics HTTP/1.1" 200 166748
"" "Prometheus/2.18.1"
________________________________________
Van: Eugen Block <[email protected]>
Verzonden: donderdag 7 april 2022 12:49
Aan: [email protected]
Onderwerp: [ceph-users] Re: Ceph status HEALT_WARN - pgs problems
Hi,
please add some more output, e.g.
ceph -s
ceph osd tree
ceph osd pool ls detail
ceph osd crush rule dump (of the used rulesets)
You have the pg_autoscaler enabled, you don't need to deal with pg_num
manually.
Zitat von Dominique Ramaekers <[email protected]>:
> Hi,
>
> My cluster is up and running. I saw a note in ceph status that 1 pg
> was undersized. I read about the amount of pgs and the recommended
> value (OSD's*100/poolsize => 6*100/3 = 200). The pg_num should be
> raised carfully, so I raised it to 2 and ceph status was fine again.
> So I left it like it was.
>
> Than I created a new pool: libvirt-pool.
>
> Now ceph status is again in warning regarding pgs. I raised
> pg_num_max of the libvirt_pool to 265 and pg_num to 128.
>
> Ceph status stays in warning.
> root@hvs001:/# ceph status
> ...
> health: HEALTH_WARN
> Reduced data availability: 64 pgs inactive
> Degraded data redundancy: 68 pgs undersized
> ...
> pgs: 94.118% pgs not active
> 4/6 objects misplaced (66.667%) -This is there from the
> beginning of the creation of the cluster-
> 64 undersized+peered
> 4 active+undersized+remapped
>
> I also get a progress: global Recovery Event (0s) which only go's
> away with 'ceph progress clear'
>
> My autoscale-status is the following:
> root@hvs001:/# ceph osd pool autoscale-status
> POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO
> TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE
> BULK
> .mgr 576.5k 3.0 1743G 0.0000
> 1.0 1 on False
> libvirt-pool 0 3.0 1743G 0.0000
> 1.0 64 on False
>
> (It's a 3 node cluster with 2 OSD's per node.)
>
> The documentation doesn't help me much here. What should I do?
>
> Greetings,
>
> Dominique.
>
>
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]