Can you please advise how to fix this (manually)?
My cluster is not getting healthy since 14 days now.
Am 24.09.2019 um 13:35 schrieb Burkhard Linke:
> Hi,
>
>
> you need to fix the non active PGs first. They are also probably the
> reason for the blocked requests.
>
>
> Regards,
>
> Burkhard
>
>
> On 9/24/19 1:30 PM, Thomas wrote:
>> Hi,
>> ceph health reports
>> 1 MDSs report slow metadata IOs
>> 1 MDSs report slow requests
>>
>> This is the complete output of ceph -s:
>> root@ld3955:~# ceph -s
>> cluster:
>> id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae
>> health: HEALTH_ERR
>> 1 MDSs report slow metadata IOs
>> 1 MDSs report slow requests
>> 72 nearfull osd(s)
>> 1 pool(s) nearfull
>> Reduced data availability: 33 pgs inactive, 32 pgs peering
>> Degraded data redundancy: 123285/153918525 objects degraded
>> (0.080%), 27 pgs degraded, 27 pgs undersized
>> Degraded data redundancy (low space): 116 pgs
>> backfill_toofull
>> 3 pools have too many placement groups
>> 54 slow requests are blocked > 32 sec
>> 179 stuck requests are blocked > 4096 sec
>>
>> services:
>> mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 21h)
>> mgr: ld5507(active, since 21h), standbys: ld5506, ld5505
>> mds: pve_cephfs:1 {0=ld3955=up:active} 1 up:standby
>> osd: 368 osds: 368 up, 368 in; 140 remapped pgs
>>
>> data:
>> pools: 6 pools, 8872 pgs
>> objects: 51.31M objects, 196 TiB
>> usage: 591 TiB used, 561 TiB / 1.1 PiB avail
>> pgs: 0.372% pgs not active
>> 123285/153918525 objects degraded (0.080%)
>> 621911/153918525 objects misplaced (0.404%)
>> 8714 active+clean
>> 90 active+remapped+backfill_toofull
>> 26 active+undersized+degraded+remapped+backfill_toofull
>> 16 peering
>> 16 remapped+peering
>> 7 active+remapped+backfill_wait
>> 1 activating
>> 1 active+recovery_wait+degraded
>> 1 active+recovery_wait+undersized+remapped
>>
>> In the log I find these relevant entries:
>> 2019-09-24 13:24:37.073695 mds.ld3955 [WRN] 2 slow requests, 0 included
>> below; oldest blocked for > 18618.873983 secs
>> 2019-09-24 13:24:42.073757 mds.ld3955 [WRN] 2 slow requests, 0 included
>> below; oldest blocked for > 18623.874055 secs
>> 2019-09-24 13:24:47.073852 mds.ld3955 [WRN] 2 slow requests, 0 included
>> below; oldest blocked for > 18628.874149 secs
>> 2019-09-24 13:24:52.073941 mds.ld3955 [WRN] 2 slow requests, 0 included
>> below; oldest blocked for > 18633.874237 secs
>> 2019-09-24 13:24:57.074073 mds.ld3955 [WRN] 2 slow requests, 0 included
>> below; oldest blocked for > 18638.874354 secs
>> 2019-09-24 13:25:02.074118 mds.ld3955 [WRN] 2 slow requests, 0 included
>> below; oldest blocked for > 18643.874415 secs
>>
>> Cephfs is residing on a pool "hdd" with dedicated HDDs (4x 17 1.6TB).
>> This pool is used for RBDs, too.
>>
>> Question:
>> How can I identify the 2 slow requests?
>> And how can I kill these requests?
>>
>> Regards
>> Thomas
>> _______________________________________________
>> ceph-users mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]