[ceph-users] Re: ceph PGs issues

Reed Dier Tue, 15 Jun 2021 09:25:20 -0700

Note: I am not entirely sure here, and would love other input from the ML about 
this, so take this with a grain of salt.


You don't show any unfound objects, which I think is excellent news as far as 
data loss.
>>            96   active+clean+scrubbing+deep+repair
The deep scrub + repair seems auspicious, and also seems like a really heavy 
operation on those PGs.

I can't tell fully, but it looks like your EC profile is K+M=12. Which could be 
10+2, 9+3, or hopefully not 11+1.
That said, being on Mimic, I am thinking that you are more than likely running 
into this: 
https://docs.ceph.com/en/latest/rados/operations/erasure-code/#erasure-coded-pool-recovery
 
<https://docs.ceph.com/en/latest/rados/operations/erasure-code/#erasure-coded-pool-recovery>
> Prior to Octopus, erasure coded pools required at least min_size shards to be 
> available, even if min_size is greater than K. (We generally recommend 
> min_size be K+2 or more to prevent loss of writes and data.) This 
> conservative decision was made out of an abundance of caution when designing 
> the new pool mode but also meant pools with lost OSDs but no data loss were 
> unable to recover and go active without manual intervention to change the 
> min_size.

I can't definitively say whether reducing the min_size will unlock the offline 
data, but I think it could.
As for what that value will be, I'm guessing just drop it by one, and see if 
PGs come out of their incomplete state.
After (hopeful) recovery, I would revert the min_size back to the original 
value for safety.

Something odd I did notice from the pastebin of ceph health detail,
> pg 3.e5 is remapped+incomplete, acting 
> [2147483647,2147483647,2147483647,2147483647,2147483647,278,2147483647,2147483647,273,2147483647,2147483647,2147483647]
> pg 3.14e is remapped+incomplete, acting 
> [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,271,2147483647,222,416,2147483647]
> pg 3.45e is remapped+incomplete, acting 
> [2147483647,2147483647,2147483647,2147483647,2147483647,377,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647]
> pg 3.4bc is remapped+incomplete, acting 
> [2147483647,280,2147483647,2147483647,2147483647,407,445,268,2147483647,2147483647,418,273]
> pg 3.7c6 is remapped+incomplete, acting 
> [2147483647,338,2147483647,2147483647,261,2147483647,2147483647,2147483647,416,415,337,2147483647]
> pg 3.8e8 is remapped+incomplete, acting 
> [2147483647,2147483647,2147483647,2147483647,360,418,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647]
>  
> pg 3.b5e is remapped+incomplete, acting 
> [2147483647,242,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,399,2147483647,2147483647]
>  

These 7 PGs are reporting a really large percentage of chunks with no OSDs 
found.
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/#erasure-coded-pgs-are-not-active-clean
 
<https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/#erasure-coded-pgs-are-not-active-clean>
I think this could possibly relate to the below bit about osd.73 throwing off 
the crush map.
I'm sure someone with more experience may have a better understanding of what 
this implies.

As for osd.73, I would remove it from the crush map.
It existing in the crush map, while not being a valid OSD may be throwing off 
the crush mappings.
I think the first step I would take would be to
$ ceph osd crush remove osd.73
$ ceph osd rm osd.73

This should reweight the ceph003 host, and cause some data movement.

So, in summation,
I would kill off osd.73 first.
Then, after some assumed rebalancing, I would then reduce the min_size to try 
and bring PGs out of an incomplete state.

As I said, I'm not entirely sure, and would love a second opinion from someone, 
but if it were me in a vacuum, I think these would be my steps.

Reed

> On Jun 15, 2021, at 10:14 AM, Aly, Adel <adel....@atos.net> wrote:
> 
> Hi Reed,
> 
> Thank you for getting back to us.
> 
> We had indeed several disk failures at the same time.
> 
> Regarding the OSD map, we have an OSD that failed and we needed to remove but 
> we didn't update the crushmap.
> 
> The question here, is it safe to update the OSD crushmap without affecting 
> the data available?
> 
> We can free up more space on the monitors if that will help indeed.
> 
> More information which can be helpful:
> 
> # ceph -v
> ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)
> 
> # ceph health detail
> https://pastebin.pl/view/2b8b337d
> 
> # ceph osd pool ls detail
> pool 3 'cephfs-data' erasure size 12 min_size 11 crush_rule 1 object_hash 
> rjenkins pg_num 3072 pgp_num 3072 last_change 370219 lfor 0/367599 flags 
> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 40960 fast_read 1 
> compression_algorithm snappy compression_mode force application cephfs
>        removed_snaps [2~7c]
> pool 4 'cephfs-meta' replicated size 3 min_size 2 crush_rule 0 object_hash 
> rjenkins pg_num 1024 pgp_num 1024 last_change 370219 lfor 0/367414 flags 
> hashpspool stripe_width 0 compression_algorithm none compression_mode none 
> application cephfs
> 
> # ceph osd tree
> https://pastebin.pl/view/eac56017
> 
> Our main struggle is when we try to rsync data, the rsync process hangs 
> because it encounters an inaccessible object.
> 
> Is there a way we can take out the incomplete PGs to be able to copy data 
> smoothly without having to reset the rsync process?
> 
> Kind regards,
> adel
> 
> -----Original Message-----
> From: Reed Dier <reed.d...@focusvq.com>
> Sent: Tuesday, June 15, 2021 4:21 PM
> To: Aly, Adel <adel....@atos.net>
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] ceph PGs issues
> 
> Caution! External email. Do not open attachments or click links, unless this 
> email comes from a known sender and you know the content is safe.
> 
> You have incomplete PGs, which means you have inactive data, because the data 
> isn't there.
> 
> This will typically only happen when you have multiple concurrent disk 
> failures, or something like that, so I think there is some missing info.
> 
>>           1 osds exist in the crush map but not in the osdmap
> 
> This seems like a red flag to have an OSD in the crush map but not the osdmap.
> 
>>           mons xyz01,xyz02 are low on available space
> 
> Your mons are probably filling up data running in the warn state.
> This can be problematic for recovery.
> 
> I think you will be more likely to receive some useful suggestions by 
> providing things like which version of ceph you are using ($ ceph -v), major 
> events that caused this, poo ($ ceph osd pool ls detail) and osd  ($ ceph osd 
> tree) topology, as well as maybe detailed health output ($ ceph health 
> detail).
> 
> Given how much data some things may be, like the osd tree, you may want to 
> paste to pastebin and link here.
> 
> Reed
> 
>> On Jun 15, 2021, at 2:48 AM, Aly, Adel <adel....@atos.net> wrote:
>> 
>> Dears,
>> 
>> We have a ceph cluster with 4096 PGs out of with +100 PGs are not 
>> active+clean.
>> 
>> On top of the ceph cluster, we have a ceph FS, with 3 active MDS servers.
>> 
>> It seems that we can’t get all the files out of it because of the affected 
>> PGs.
>> 
>> The object store has more than 400 million objects.
>> 
>> When we do “rados -p cephfs-data ls”, the listing stops (hangs) after 
>> listing +11 million objects.
>> 
>> When we try to access an object which we can’t copy, the rados command hangs 
>> forever:
>> 
>> ls -I <filename>
>> 2199140525188
>> 
>> printf "%x\n" 2199140525188
>> 20006fd6484
>> 
>> rados -p cephfs-data stat 20006fd6484.00000000 (hangs here)
>> 
>> This is the current status of the ceph cluster:
>>   health: HEALTH_WARN
>>           1 MDSs report slow metadata IOs
>>           1 MDSs report slow requests
>>           1 MDSs behind on trimming
>>           1 osds exist in the crush map but not in the osdmap
>>           *Reduced data availability: 22 pgs inactive, 22 pgs incomplete*
>>           240324 slow ops, oldest one blocked for 391503 sec, daemons
>> [osd.144,osd.159,osd.180,osd.184,osd.242,osd.271,osd.275,osd.278,osd.280,osd.332]...
>>  h ave slow ops.
>>           mons xyz01,xyz02 are low on available space
>> 
>> services:
>>   mon: 4 daemons, quorum abc001,abc002,xyz02,xyz01
>>   mgr: abc002(active), standbys: xyz01, xyz02, abc001
>>   mds: cephfs-3/3/3 up  
>> {0=xyz02=up:active,1=abc001=up:active,2=abc002=up:active}, 1 up:standby
>>   osd: 421 osds: 421 up, 421 in; 7 remapped pgs
>> 
>> data:
>>   pools:   2 pools, 4096 pgs
>>   objects: 403.4 M objects, 846 TiB
>>   usage:   1.2 PiB used, 1.4 PiB / 2.6 PiB avail
>>   pgs:     0.537% pgs not active
>>            3968 active+clean
>>            96   active+clean+scrubbing+deep+repair
>>            15   incomplete
>>            10   active+clean+scrubbing
>>            7    remapped+incomplete
>> 
>> io:
>>   client:   89 KiB/s rd, 13 KiB/s wr, 34 op/s rd, 1 op/s wr
>> 
>> The 100+ PGs have been in this state for a long time already.
>> 
>> Sometimes when we try to copy some files the rsync process hangs and we 
>> can’t kill it and from the process stack, it seems to be hanging on ceph i/o 
>> operation.
>> 
>> # cat /proc/51795/stack
>> [<ffffffffc184206d>] ceph_mdsc_do_request+0xfd/0x280 [ceph]
>> [<ffffffffc181e92e>] __ceph_do_getattr+0x9e/0x200 [ceph]
>> [<ffffffffc181eb08>] ceph_getattr+0x28/0x100 [ceph]
>> [<ffffffffab853689>] vfs_getattr+0x49/0x80 [<ffffffffab8537b5>]
>> vfs_fstatat+0x75/0xc0 [<ffffffffab853bc1>] SYSC_newlstat+0x31/0x60
>> [<ffffffffab85402e>] SyS_newlstat+0xe/0x10 [<ffffffffabd93f92>]
>> system_call_fastpath+0x25/0x2a [<ffffffffffffffff>] 0xffffffffffffffff
>> 
>> # cat /proc/51795/mem
>> cat: /proc/51795/mem: Input/output error
>> 
>> Any idea on how to move forward with debugging and fixing this issue so we 
>> can get the data out of the ceph FS?
>> 
>> Thank you in advance.
>> 
>> Kind regards,
>> adel
>> 
>> This e-mail and the documents attached are confidential and intended solely 
>> for the addressee; it may also be privileged. If you receive this e-mail in 
>> error, please notify the sender immediately and destroy it. As its integrity 
>> cannot be secured on the Internet, Atos’ liability cannot be triggered for 
>> the message content. Although the sender endeavours to maintain a computer 
>> virus-free network, the sender does not warrant that this transmission is 
>> virus-free and will not be liable for any damages resulting from any virus 
>> transmitted. On all offers and agreements under which Atos Nederland B.V. 
>> supplies goods and/or services of whatever nature, the Terms of Delivery 
>> from Atos Nederland B.V. exclusively apply. The Terms of Delivery shall be 
>> promptly submitted to you on your request.
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
>> email to ceph-users-le...@ceph.io
> 
> This e-mail and the documents attached are confidential and intended solely 
> for the addressee; it may also be privileged. If you receive this e-mail in 
> error, please notify the sender immediately and destroy it. As its integrity 
> cannot be secured on the Internet, Atos’ liability cannot be triggered for 
> the message content. Although the sender endeavours to maintain a computer 
> virus-free network, the sender does not warrant that this transmission is 
> virus-free and will not be liable for any damages resulting from any virus 
> transmitted. On all offers and agreements under which Atos Nederland B.V. 
> supplies goods and/or services of whatever nature, the Terms of Delivery from 
> Atos Nederland B.V. exclusively apply. The Terms of Delivery shall be 
> promptly submitted to you on your request.

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph PGs issues

Reply via email to