[ceph-users] Re: Cluster Health error's status

Etienne Menguy Fri, 29 Oct 2021 02:45:17 -0700

> Is a way there you can enforce mon to rejoin a quorum ? I tried to restart it 
> but nothing changed. I guess it is the cause If I am not mistaken.



No, but with quorum_status you can check monitor status and if it’s trying to 
join quorum.
You may have to use daemon socket interface (asok file) to directly get info 
for this monitor.

https://docs.ceph.com/en/latest/rados/operations/monitoring/#checking-monitor-status

Which OSD were down? As written by Eugen, having having crush rule and failure 
would be useful. It’s unusual that a single host failure triggers this issue.

>  I guess it is the cause If I am not mistaken
I don’t think monitor issue is the root cause of the unfound objects. You could 
easily delete monitor and deploy it again to “fix” your monitor quorum.


-
Etienne Menguy
[email protected]




> On 29 Oct 2021, at 11:30, Michel Niyoyita <[email protected]> wrote:
> 
> Dear Etienne 
> 
> Is a way there you can enforce mon to rejoin a quorum ? I tried to restart it 
> but nothing changed. I guess it is the cause If I am not mistaken.
> 
> below is pg querry output
> 
> 
> Regards 
> 
> On Fri, Oct 29, 2021 at 10:56 AM Etienne Menguy <[email protected] 
> <mailto:[email protected]>> wrote:
> With “ceph pg x.y query” you can check why it’s complaining.
> 
> x.y for pg id, like 5.77 
> 
> It would also be interesting to check why mon fails to rejoin quorum, it may 
> give you hints at your OSD issues.
> 
> -
> Etienne Menguy
> [email protected] <mailto:[email protected]>
> 
> 
> 
> 
>> On 29 Oct 2021, at 10:34, Michel Niyoyita <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Hello Etienne
>> 
>> This is the ceph -s output
>> 
>> root@ceph-mon1:~# ceph -s
>>   cluster:
>>     id:     43f5d6b4-74b0-4281-92ab-940829d3ee5e
>>     health: HEALTH_ERR
>>             1/3 mons down, quorum ceph-mon1,ceph-mon3
>>             14/47681 objects unfound (0.029%)
>>             1 scrub errors
>>             Possible data damage: 13 pgs recovery_unfound, 1 pg inconsistent
>>             Degraded data redundancy: 42/143043 objects degraded (0.029%), 
>> 13 pgs degraded
>>             2 slow ops, oldest one blocked for 2897 sec, daemons 
>> [osd.0,osd.7] have slow ops.
>> 
>>   services:
>>     mon: 3 daemons, quorum ceph-mon1,ceph-mon3 (age 2h), out of quorum: 
>> ceph-mon4
>>     mgr: ceph-mon1(active, since 25h), standbys: ceph-mon2
>>     osd: 12 osds: 12 up (since 97m), 12 in (since 25h); 10 remapped pgs
>> 
>>   data:
>>     pools:   5 pools, 225 pgs
>>     objects: 47.68k objects, 204 GiB
>>     usage:   603 GiB used, 4.1 TiB / 4.7 TiB avail
>>     pgs:     42/143043 objects degraded (0.029%)
>>              2460/143043 objects misplaced (1.720%)
>>              14/47681 objects unfound (0.029%)
>>              211 active+clean
>>              10  active+recovery_unfound+degraded+remapped
>>              3   active+recovery_unfound+degraded
>>              1   active+clean+inconsistent
>> 
>>   io:
>>     client:   2.0 KiB/s rd, 88 KiB/s wr, 2 op/s rd, 12 op/s wr
>> 
>> On Fri, Oct 29, 2021 at 10:09 AM Etienne Menguy <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Hi,
>> 
>> Please share “ceph -s” output.
>> 
>> -
>> Etienne Menguy
>> [email protected] <mailto:[email protected]>
>> 
>> 
>> 
>> 
>>> On 29 Oct 2021, at 10:03, Michel Niyoyita <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Hello team
>>> 
>>> I am running a ceph cluster with 3 monitors and 4 OSDs nodes running 3osd
>>> each , I deployed my ceph cluster using ansible and ubuntu 20.04 as OS ,
>>> the ceph version is Octopus. yesterday , My server which hosts OSDs nodes
>>> restarted because of power issue and to comeback on its status one of the
>>> monitor is out of quorum and some Pg marks as damaged . please help me to
>>> solve this issue. below are health detail status I am finding. and the  4
>>> OSDs node are the same which are running monitors (3 of them).
>>> 
>>> Best regards.
>>> 
>>> Michel
>>> 
>>> 
>>> root@ceph-mon1:~# ceph health detail
>>> HEALTH_ERR 1/3 mons down, quorum ceph-mon1,ceph-mon3; 14/47195 objects
>>> unfound (0.030%); Possible data damage: 13 pgs recovery_unfound; Degraded
>>> data redundancy: 42/141585 objects degraded (0.030%), 13 pgs degraded; 2
>>> slow ops, oldest one blocked for 322 sec, daemons [osd.0,osd.7] have slow
>>> ops.
>>> [WRN] MON_DOWN: 1/3 mons down, quorum ceph-mon1,ceph-mon3
>>>    mon.ceph-mon4 (rank 2) addr [v2:
>>> 10.10.29.154:3300/0,v1:10.10.29.154:6789/0 
>>> <http://10.10.29.154:3300/0,v1:10.10.29.154:6789/0>] is down (out of quorum)
>>> [WRN] OBJECT_UNFOUND: 14/47195 objects unfound (0.030%)
>>>    pg 5.77 has 1 unfound objects
>>>    pg 5.6d has 2 unfound objects
>>>    pg 5.6a has 1 unfound objects
>>>    pg 5.65 has 1 unfound objects
>>>    pg 5.4a has 1 unfound objects
>>>    pg 5.30 has 1 unfound objects
>>>    pg 5.28 has 1 unfound objects
>>>    pg 5.25 has 1 unfound objects
>>>    pg 5.19 has 1 unfound objects
>>>    pg 5.1a has 1 unfound objects
>>>    pg 5.1 has 1 unfound objects
>>>    pg 5.b has 1 unfound objects
>>>    pg 5.8 has 1 unfound objects
>>> [ERR] PG_DAMAGED: Possible data damage: 13 pgs recovery_unfound
>>>    pg 5.1 is active+recovery_unfound+degraded+remapped, acting [5,8,7], 1
>>> unfound
>>>    pg 5.8 is active+recovery_unfound+degraded+remapped, acting [6,11,8], 1
>>> unfound
>>>    pg 5.b is active+recovery_unfound+degraded+remapped, acting [7,0,5], 1
>>> unfound
>>>    pg 5.19 is active+recovery_unfound+degraded+remapped, acting [0,5,7], 1
>>> unfound
>>>    pg 5.1a is active+recovery_unfound+degraded, acting [10,11,8], 1 unfound
>>>    pg 5.25 is active+recovery_unfound+degraded+remapped, acting [0,10,11],
>>> 1 unfound
>>>    pg 5.28 is active+recovery_unfound+degraded+remapped, acting [6,11,8],
>>> 1 unfound
>>>    pg 5.30 is active+recovery_unfound+degraded+remapped, acting [7,5,0], 1
>>> unfound
>>>    pg 5.4a is active+recovery_unfound+degraded, acting [0,11,7], 1 unfound
>>>    pg 5.65 is active+recovery_unfound+degraded+remapped, acting [0,10,11],
>>> 1 unfound
>>>    pg 5.6a is active+recovery_unfound+degraded, acting [0,11,7], 1 unfound
>>>    pg 5.6d is active+recovery_unfound+degraded+remapped, acting [7,2,0], 2
>>> unfound
>>>    pg 5.77 is active+recovery_unfound+degraded+remapped, acting [5,6,8], 1
>>> unfound
>>> [WRN] PG_DEGRADED: Degraded data redundancy: 42/141585 objects degraded
>>> (0.030%), 13 pgs degraded
>>>    pg 5.1 is active+recovery_unfound+degraded+remapped, acting [5,8,7], 1
>>> unfound
>>>    pg 5.8 is active+recovery_unfound+degraded+remapped, acting [6,11,8], 1
>>> unfound
>>>    pg 5.b is active+recovery_unfound+degraded+remapped, acting [7,0,5], 1
>>> unfound
>>>    pg 5.19 is active+recovery_unfound+degraded+remapped, acting [0,5,7], 1
>>> unfound
>>>    pg 5.1a is active+recovery_unfound+degraded, acting [10,11,8], 1 unfound
>>>    pg 5.25 is active+recovery_unfound+degraded+remapped, acting [0,10,11],
>>> 1 unfound
>>>    pg 5.28 is active+recovery_unfound+degraded+remapped, acting [6,11,8],
>>> 1 unfound
>>>    pg 5.30 is active+recovery_unfound+degraded+remapped, acting [7,5,0], 1
>>> unfound
>>>    pg 5.4a is active+recovery_unfound+degraded, acting [0,11,7], 1 unfound
>>>    pg 5.65 is active+recovery_unfound+degraded+remapped, acting [0,10,11],
>>> 1 unfound
>>>    pg 5.6a is active+recovery_unfound+degraded, acting [0,11,7], 1 unfound
>>>    pg 5.6d is active+recovery_unfound+degraded+remapped, acting [7,2,0], 2
>>> unfound
>>>    pg 5.77 is active+recovery_unfound+degraded+remapped, acting [5,6,8], 1
>>> unfound
>>> [WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 322 sec, daemons
>>> [osd.0,osd.7] have slow ops.
>>> _______________________________________________
>>> ceph-users mailing list -- [email protected] <mailto:[email protected]>
>>> To unsubscribe send an email to [email protected] 
>>> <mailto:[email protected]>
>> 
> 

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Cluster Health error's status

Reply via email to