Hi all,
I have a large distributed ceph cluster that recently broke with all PGs housed
at a single site getting marked as 'unknown' after a run of the Ceph Ansible
playbook (which was being used to expand the cluster at a third site). Is
there a way to recover the location of PGs in this state, or a way to fall back
to a previous config where things were working? Or a way to scan the OSDs to
determine which PGs are housed there? All the OSDs are still in place and
reporting as healthy, it's just the PG locations that are missing. For info:
the ceph cluster is used to provide a single shared CephFS mount for a
distributed batch cluster, and it includes workers and pools of OSDs from three
different OpenStack clouds.
Ceph version: 13.2.8
Here is the system health:
[root@euclid-edi-ctrl-0 ~]# ceph -s
cluster:
id: 0fe7e967-ecd6-46d4-9f6b-224539073d3b
health: HEALTH_WARN
insufficient standby MDS daemons available
1 MDSs report slow metadata IOs
Reduced data availability: 1024 pgs inactive
6 slow ops, oldest one blocked for 244669 sec,
mon.euclid-edi-ctrl-0 has slow ops
too few PGs per OSD (26 < min 30)
services:
mon: 4 daemons, quorum
euclid-edi-ctrl-0,euclid-cam-proxy-0,euclid-imp-proxy-0,euclid-ral-proxy-0
mgr: euclid-edi-ctrl-0(active), standbys: euclid-imp-proxy-0,
euclid-cam-proxy-0, euclid-ral-proxy-0
mds: cephfs-2/2/2 up
{0=euclid-ral-proxy-0=up:active,1=euclid-cam-proxy-0=up:active}
osd: 269 osds: 269 up, 269 in
data:
pools: 5 pools, 5120 pgs
objects: 30.54 M objects, 771 GiB
usage: 3.8 TiB used, 41 TiB / 45 TiB avail
pgs: 20.000% pgs unknown
4095 active+clean
1024 unknown
1 active+clean+scrubbing
OSD Pools:
[root@euclid-edi-ctrl-0 ~]# ceph osd lspools
1 cephfs_data
2 cephfs_metadata
3 euclid_cam
4 euclid_ral
5 euclid_imp
[root@euclid-edi-ctrl-0 ~]# ceph pg dump_pools_json
dumped pools
POOLID OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES
OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG
5 0 0 0 0 0 0
0 0 0 0
1 16975540 0 0 0 0 79165311663
0 0 6243475 6243475
2 5171099 0 0 0 0 551991405
126879876 270829 3122183 3122183
3 8393436 0 0 0 0 748466429315
0 0 1556647 1556647
4 0 0 0 0 0 0
0 0 0 0
[root@euclid-edi-ctrl-0 ~]# ceph health detail
...
PG_AVAILABILITY Reduced data availability: 1024 pgs inactive
pg 4.3c8 is stuck inactive for 246794.767182, current state unknown, last
acting []
pg 4.3ca is stuck inactive for 246794.767182, current state unknown, last
acting []
pg 4.3cb is stuck inactive for 246794.767182, current state unknown, last
acting []
pg 4.3d0 is stuck inactive for 246794.767182, current state unknown, last
acting []
pg 4.3d1 is stuck inactive for 246794.767182, current state unknown, last
acting []
pg 4.3d2 is stuck inactive for 246794.767182, current state unknown, last
acting []
pg 4.3d3 is stuck inactive for 246794.767182, current state unknown, last
acting []
pg 4.3d4 is stuck inactive for 246794.767182, current state unknown, last
acting []
pg 4.3d5 is stuck inactive for 246794.767182, current state unknown, last
acting []
pg 4.3d6 is stuck inactive for 246794.767182, current state unknown, last
acting []
pg 4.3d7 is stuck inactive for 246794.767182, current state unknown, last
acting []
pg 4.3d8 is stuck inactive for 246794.767182, current state unknown, last
acting []
pg 4.3d9 is stuck inactive for 246794.767182, current state unknown, last
acting []
pg 4.3da is stuck inactive for 246794.767182, current state unknown, last
acting []
...
[root@euclid-edi-ctrl-0 ~]# ceph pg map 4.3c8
osdmap e284992 pg 4.3c8 (4.3c8) -> up [] acting []
Cheers,
Mark
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]