[ceph-users] Re: Cluster Health error's status

Michel Niyoyita Fri, 29 Oct 2021 07:02:09 -0700

The OSDs are up and in , I have the problem on PGs as you see below

root@ceph-mon1:~# ceph -s
  cluster:
    id:     43f5d6b4-74b0-4281-92ab-940829d3ee5e
    health: HEALTH_ERR
            1/3 mons down, quorum ceph-mon1,ceph-mon3
            14/32863 objects unfound (0.043%)
            Possible data damage: 13 pgs recovery_unfound
            Degraded data redundancy: 42/98589 objects degraded (0.043%), 9
pgs degraded
            5 daemons have recently crashed
            1 slow ops, oldest one blocked for 22521 sec, osd.7 has slow ops


  services:
    mon: 3 daemons, quorum ceph-mon1,ceph-mon3 (age 41m), out of quorum:
ceph-mon4
    mgr: ceph-mon1(active, since 30h), standbys: ceph-mon2
    osd: 12 osds: 12 up (since 75m), 12 in (since 30h); 10 remapped pgs

  data:
    pools:   5 pools, 225 pgs
    objects: 32.86k objects, 129 GiB
    usage:   384 GiB used, 4.3 TiB / 4.7 TiB avail
    pgs:     42/98589 objects degraded (0.043%)
             1811/98589 objects misplaced (1.837%)
             14/32863 objects unfound (0.043%)
             212 active+clean
             6   active+recovery_unfound+degraded+remapped
             4   active+recovery_unfound+remapped
             3   active+recovery_unfound+degraded

  io:
    client:   34 KiB/s rd, 41 op/s rd, 0 op/s wr

On Fri, Oct 29, 2021 at 3:10 PM Etienne Menguy <[email protected]>
wrote:

> Could your hardware be faulty?
>
> You are trying to deploy the faulty monitor? Or a whole new cluster?
>
> If you are trying to fix your cluster, you should focus on OSD.
> A cluster can run without big troubles with 2 monitors for few days (if
> not years…).
>
> -
> Etienne Menguy
> [email protected]
>
>
>
>
> On 29 Oct 2021, at 14:08, Michel Niyoyita <[email protected]> wrote:
>
> Hello team
>
> Below is the error , I am getting once I try to redeploy the same cluster
>
> TASK [ceph-mon : recursively fix ownership of monitor directory]
> ******************************************************************************************************************************************************
> Friday 29 October 2021  12:07:18 +0000 (0:00:00.411)       0:01:41.157
> ********
> ok: [ceph-mon1]
> ok: [ceph-mon2]
> ok: [ceph-mon3]
> An exception occurred during task execution. To see the full traceback,
> use -vvv. The error was: OSError: [Errno 117] Structure needs cleaning:
> b'/var/lib/ceph/mon/ceph-ceph-mon4/store.db/216815.log'
> fatal: [ceph-mon4]: FAILED! => changed=false
>   module_stderr: |-
>     Traceback (most recent call last):
>       File "<stdin>", line 102, in <module>
>       File "<stdin>", line 94, in _ansiballz_main
>       File "<stdin>", line 40, in invoke_module
>       File "/usr/lib/python3.8/runpy.py", line 207, in run_module
>         return _run_module_code(code, init_globals, run_name, mod_spec)
>       File "/usr/lib/python3.8/runpy.py", line 97, in _run_module_code
>         _run_code(code, mod_globals, init_globals,
>       File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
>         exec(code, run_globals)
>       File
> "/tmp/ansible_file_payload_uucq1oqa/ansible_file_payload.zip/ansible/modules/files/file.py",
> line 940, in <module>
>       File
> "/tmp/ansible_file_payload_uucq1oqa/ansible_file_payload.zip/ansible/modules/files/file.py",
> line 926, in main
>       File
> "/tmp/ansible_file_payload_uucq1oqa/ansible_file_payload.zip/ansible/modules/files/file.py",
> line 665, in ensure_directory
>       File
> "/tmp/ansible_file_payload_uucq1oqa/ansible_file_payload.zip/ansible/modules/files/file.py",
> line 340, in recursive_set_attributes
>       File
> "/tmp/ansible_file_payload_uucq1oqa/ansible_file_payload.zip/ansible/module_utils/basic.py",
> line 1335, in set_fs_attributes_if_different
>       File
> "/tmp/ansible_file_payload_uucq1oqa/ansible_file_payload.zip/ansible/module_utils/basic.py",
> line 988, in set_owner_if_different
>       File
> "/tmp/ansible_file_payload_uucq1oqa/ansible_file_payload.zip/ansible/module_utils/basic.py",
> line 883, in user_and_group
>     OSError: [Errno 117] Structure needs cleaning:
> b'/var/lib/ceph/mon/ceph-ceph-mon4/store.db/216815.log'
>   module_stdout: ''
>   msg: |-
>     MODULE FAILURE
>     See stdout/stderr for the exact error
>   rc: 1
>
>
>
> On Fri, Oct 29, 2021 at 12:37 PM Etienne Menguy <[email protected]>
> wrote:
>
>> Have you tried to restart one of the OSD that seems to block PG recover?
>>
>> I don’t think increasing PG can help.
>> -
>> Etienne Menguy
>> [email protected]
>>
>>
>>
>>
>> On 29 Oct 2021, at 11:53, Michel Niyoyita <[email protected]> wrote:
>>
>> Hello Eugen
>>
>> The failure_domain is host level and crush rule is replicated_rule in
>> troubleshooting process I changed for pool 5 its PG from 32 to 128 to see
>> if there
>> can be some changes. and it has the default replica (3)
>>
>> Thanks for your continous help
>>
>> On Fri, Oct 29, 2021 at 11:44 AM Etienne Menguy <[email protected]>
>> wrote:
>>
>>> Is a way there you can enforce mon to rejoin a quorum ? I tried to
>>> restart it but nothing changed. I guess it is the cause If I am not
>>> mistaken.
>>>
>>>
>>> No, but with quorum_status you can check monitor status and if it’s
>>> trying to join quorum.
>>> You may have to use daemon socket interface (asok file) to directly get
>>> info for this monitor.
>>>
>>>
>>> https://docs.ceph.com/en/latest/rados/operations/monitoring/#checking-monitor-status
>>>
>>> Which OSD were down? As written by Eugen, having having crush rule and
>>> failure would be useful. It’s unusual that a single host failure triggers
>>> this issue.
>>>
>>>  I guess it is the cause If I am not mistaken
>>>
>>> I don’t think monitor issue is the root cause of the unfound objects.
>>> You could easily delete monitor and deploy it again to “fix” your monitor
>>> quorum.
>>>
>>>
>>> -
>>> Etienne Menguy
>>> [email protected]
>>>
>>>
>>>
>>>
>>> On 29 Oct 2021, at 11:30, Michel Niyoyita <[email protected]> wrote:
>>>
>>> Dear Etienne
>>>
>>> Is a way there you can enforce mon to rejoin a quorum ? I tried to
>>> restart it but nothing changed. I guess it is the cause If I am not
>>> mistaken.
>>>
>>> below is pg querry output
>>>
>>>
>>> Regards
>>>
>>> On Fri, Oct 29, 2021 at 10:56 AM Etienne Menguy <[email protected]>
>>> wrote:
>>>
>>>> With “ceph pg x.y query” you can check why it’s complaining.
>>>>
>>>> x.y for pg id, like 5.77
>>>>
>>>> It would also be interesting to check why mon fails to rejoin quorum,
>>>> it may give you hints at your OSD issues.
>>>>
>>>> -
>>>> Etienne Menguy
>>>> [email protected]
>>>>
>>>>
>>>>
>>>>
>>>> On 29 Oct 2021, at 10:34, Michel Niyoyita <[email protected]> wrote:
>>>>
>>>> Hello Etienne
>>>>
>>>> This is the ceph -s output
>>>>
>>>> root@ceph-mon1:~# ceph -s
>>>>   cluster:
>>>>     id:     43f5d6b4-74b0-4281-92ab-940829d3ee5e
>>>>     health: HEALTH_ERR
>>>>             1/3 mons down, quorum ceph-mon1,ceph-mon3
>>>>             14/47681 objects unfound (0.029%)
>>>>             1 scrub errors
>>>>             Possible data damage: 13 pgs recovery_unfound, 1 pg
>>>> inconsistent
>>>>             Degraded data redundancy: 42/143043 objects degraded
>>>> (0.029%), 13 pgs degraded
>>>>             2 slow ops, oldest one blocked for 2897 sec, daemons
>>>> [osd.0,osd.7] have slow ops.
>>>>
>>>>   services:
>>>>     mon: 3 daemons, quorum ceph-mon1,ceph-mon3 (age 2h), out of quorum:
>>>> ceph-mon4
>>>>     mgr: ceph-mon1(active, since 25h), standbys: ceph-mon2
>>>>     osd: 12 osds: 12 up (since 97m), 12 in (since 25h); 10 remapped pgs
>>>>
>>>>   data:
>>>>     pools:   5 pools, 225 pgs
>>>>     objects: 47.68k objects, 204 GiB
>>>>     usage:   603 GiB used, 4.1 TiB / 4.7 TiB avail
>>>>     pgs:     42/143043 objects degraded (0.029%)
>>>>              2460/143043 objects misplaced (1.720%)
>>>>              14/47681 objects unfound (0.029%)
>>>>              211 active+clean
>>>>              10  active+recovery_unfound+degraded+remapped
>>>>              3   active+recovery_unfound+degraded
>>>>              1   active+clean+inconsistent
>>>>
>>>>   io:
>>>>     client:   2.0 KiB/s rd, 88 KiB/s wr, 2 op/s rd, 12 op/s wr
>>>>
>>>> On Fri, Oct 29, 2021 at 10:09 AM Etienne Menguy <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Please share “ceph -s” output.
>>>>>
>>>>> -
>>>>> Etienne Menguy
>>>>> [email protected]
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 29 Oct 2021, at 10:03, Michel Niyoyita <[email protected]> wrote:
>>>>>
>>>>> Hello team
>>>>>
>>>>> I am running a ceph cluster with 3 monitors and 4 OSDs nodes running
>>>>> 3osd
>>>>> each , I deployed my ceph cluster using ansible and ubuntu 20.04 as OS
>>>>> ,
>>>>> the ceph version is Octopus. yesterday , My server which hosts OSDs
>>>>> nodes
>>>>> restarted because of power issue and to comeback on its status one of
>>>>> the
>>>>> monitor is out of quorum and some Pg marks as damaged . please help me
>>>>> to
>>>>> solve this issue. below are health detail status I am finding. and the
>>>>>  4
>>>>> OSDs node are the same which are running monitors (3 of them).
>>>>>
>>>>> Best regards.
>>>>>
>>>>> Michel
>>>>>
>>>>>
>>>>> root@ceph-mon1:~# ceph health detail
>>>>> HEALTH_ERR 1/3 mons down, quorum ceph-mon1,ceph-mon3; 14/47195 objects
>>>>> unfound (0.030%); Possible data damage: 13 pgs recovery_unfound;
>>>>> Degraded
>>>>> data redundancy: 42/141585 objects degraded (0.030%), 13 pgs degraded;
>>>>> 2
>>>>> slow ops, oldest one blocked for 322 sec, daemons [osd.0,osd.7] have
>>>>> slow
>>>>> ops.
>>>>> [WRN] MON_DOWN: 1/3 mons down, quorum ceph-mon1,ceph-mon3
>>>>>    mon.ceph-mon4 (rank 2) addr [v2:
>>>>> 10.10.29.154:3300/0,v1:10.10.29.154:6789/0] is down (out of quorum)
>>>>> [WRN] OBJECT_UNFOUND: 14/47195 objects unfound (0.030%)
>>>>>    pg 5.77 has 1 unfound objects
>>>>>    pg 5.6d has 2 unfound objects
>>>>>    pg 5.6a has 1 unfound objects
>>>>>    pg 5.65 has 1 unfound objects
>>>>>    pg 5.4a has 1 unfound objects
>>>>>    pg 5.30 has 1 unfound objects
>>>>>    pg 5.28 has 1 unfound objects
>>>>>    pg 5.25 has 1 unfound objects
>>>>>    pg 5.19 has 1 unfound objects
>>>>>    pg 5.1a has 1 unfound objects
>>>>>    pg 5.1 has 1 unfound objects
>>>>>    pg 5.b has 1 unfound objects
>>>>>    pg 5.8 has 1 unfound objects
>>>>> [ERR] PG_DAMAGED: Possible data damage: 13 pgs recovery_unfound
>>>>>    pg 5.1 is active+recovery_unfound+degraded+remapped, acting
>>>>> [5,8,7], 1
>>>>> unfound
>>>>>    pg 5.8 is active+recovery_unfound+degraded+remapped, acting
>>>>> [6,11,8], 1
>>>>> unfound
>>>>>    pg 5.b is active+recovery_unfound+degraded+remapped, acting
>>>>> [7,0,5], 1
>>>>> unfound
>>>>>    pg 5.19 is active+recovery_unfound+degraded+remapped, acting
>>>>> [0,5,7], 1
>>>>> unfound
>>>>>    pg 5.1a is active+recovery_unfound+degraded, acting [10,11,8], 1
>>>>> unfound
>>>>>    pg 5.25 is active+recovery_unfound+degraded+remapped, acting
>>>>> [0,10,11],
>>>>> 1 unfound
>>>>>    pg 5.28 is active+recovery_unfound+degraded+remapped, acting
>>>>> [6,11,8],
>>>>> 1 unfound
>>>>>    pg 5.30 is active+recovery_unfound+degraded+remapped, acting
>>>>> [7,5,0], 1
>>>>> unfound
>>>>>    pg 5.4a is active+recovery_unfound+degraded, acting [0,11,7], 1
>>>>> unfound
>>>>>    pg 5.65 is active+recovery_unfound+degraded+remapped, acting
>>>>> [0,10,11],
>>>>> 1 unfound
>>>>>    pg 5.6a is active+recovery_unfound+degraded, acting [0,11,7], 1
>>>>> unfound
>>>>>    pg 5.6d is active+recovery_unfound+degraded+remapped, acting
>>>>> [7,2,0], 2
>>>>> unfound
>>>>>    pg 5.77 is active+recovery_unfound+degraded+remapped, acting
>>>>> [5,6,8], 1
>>>>> unfound
>>>>> [WRN] PG_DEGRADED: Degraded data redundancy: 42/141585 objects degraded
>>>>> (0.030%), 13 pgs degraded
>>>>>    pg 5.1 is active+recovery_unfound+degraded+remapped, acting
>>>>> [5,8,7], 1
>>>>> unfound
>>>>>    pg 5.8 is active+recovery_unfound+degraded+remapped, acting
>>>>> [6,11,8], 1
>>>>> unfound
>>>>>    pg 5.b is active+recovery_unfound+degraded+remapped, acting
>>>>> [7,0,5], 1
>>>>> unfound
>>>>>    pg 5.19 is active+recovery_unfound+degraded+remapped, acting
>>>>> [0,5,7], 1
>>>>> unfound
>>>>>    pg 5.1a is active+recovery_unfound+degraded, acting [10,11,8], 1
>>>>> unfound
>>>>>    pg 5.25 is active+recovery_unfound+degraded+remapped, acting
>>>>> [0,10,11],
>>>>> 1 unfound
>>>>>    pg 5.28 is active+recovery_unfound+degraded+remapped, acting
>>>>> [6,11,8],
>>>>> 1 unfound
>>>>>    pg 5.30 is active+recovery_unfound+degraded+remapped, acting
>>>>> [7,5,0], 1
>>>>> unfound
>>>>>    pg 5.4a is active+recovery_unfound+degraded, acting [0,11,7], 1
>>>>> unfound
>>>>>    pg 5.65 is active+recovery_unfound+degraded+remapped, acting
>>>>> [0,10,11],
>>>>> 1 unfound
>>>>>    pg 5.6a is active+recovery_unfound+degraded, acting [0,11,7], 1
>>>>> unfound
>>>>>    pg 5.6d is active+recovery_unfound+degraded+remapped, acting
>>>>> [7,2,0], 2
>>>>> unfound
>>>>>    pg 5.77 is active+recovery_unfound+degraded+remapped, acting
>>>>> [5,6,8], 1
>>>>> unfound
>>>>> [WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 322 sec, daemons
>>>>> [osd.0,osd.7] have slow ops.
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- [email protected]
>>>>> To unsubscribe send an email to [email protected]
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Cluster Health error's status

Reply via email to