Note: I am not entirely sure here, and would love other input from the ML about this, so take this with a grain of salt.
You don't show any unfound objects, which I think is excellent news as far as data loss. >> 96 active+clean+scrubbing+deep+repair The deep scrub + repair seems auspicious, and also seems like a really heavy operation on those PGs. I can't tell fully, but it looks like your EC profile is K+M=12. Which could be 10+2, 9+3, or hopefully not 11+1. That said, being on Mimic, I am thinking that you are more than likely running into this: https://docs.ceph.com/en/latest/rados/operations/erasure-code/#erasure-coded-pool-recovery <https://docs.ceph.com/en/latest/rados/operations/erasure-code/#erasure-coded-pool-recovery> > Prior to Octopus, erasure coded pools required at least min_size shards to be > available, even if min_size is greater than K. (We generally recommend > min_size be K+2 or more to prevent loss of writes and data.) This > conservative decision was made out of an abundance of caution when designing > the new pool mode but also meant pools with lost OSDs but no data loss were > unable to recover and go active without manual intervention to change the > min_size. I can't definitively say whether reducing the min_size will unlock the offline data, but I think it could. As for what that value will be, I'm guessing just drop it by one, and see if PGs come out of their incomplete state. After (hopeful) recovery, I would revert the min_size back to the original value for safety. Something odd I did notice from the pastebin of ceph health detail, > pg 3.e5 is remapped+incomplete, acting > [2147483647,2147483647,2147483647,2147483647,2147483647,278,2147483647,2147483647,273,2147483647,2147483647,2147483647] > pg 3.14e is remapped+incomplete, acting > [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,271,2147483647,222,416,2147483647] > pg 3.45e is remapped+incomplete, acting > [2147483647,2147483647,2147483647,2147483647,2147483647,377,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647] > pg 3.4bc is remapped+incomplete, acting > [2147483647,280,2147483647,2147483647,2147483647,407,445,268,2147483647,2147483647,418,273] > pg 3.7c6 is remapped+incomplete, acting > [2147483647,338,2147483647,2147483647,261,2147483647,2147483647,2147483647,416,415,337,2147483647] > pg 3.8e8 is remapped+incomplete, acting > [2147483647,2147483647,2147483647,2147483647,360,418,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647] > > pg 3.b5e is remapped+incomplete, acting > [2147483647,242,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,399,2147483647,2147483647] > These 7 PGs are reporting a really large percentage of chunks with no OSDs found. https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/#erasure-coded-pgs-are-not-active-clean <https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/#erasure-coded-pgs-are-not-active-clean> I think this could possibly relate to the below bit about osd.73 throwing off the crush map. I'm sure someone with more experience may have a better understanding of what this implies. As for osd.73, I would remove it from the crush map. It existing in the crush map, while not being a valid OSD may be throwing off the crush mappings. I think the first step I would take would be to $ ceph osd crush remove osd.73 $ ceph osd rm osd.73 This should reweight the ceph003 host, and cause some data movement. So, in summation, I would kill off osd.73 first. Then, after some assumed rebalancing, I would then reduce the min_size to try and bring PGs out of an incomplete state. As I said, I'm not entirely sure, and would love a second opinion from someone, but if it were me in a vacuum, I think these would be my steps. Reed > On Jun 15, 2021, at 10:14 AM, Aly, Adel <adel....@atos.net> wrote: > > Hi Reed, > > Thank you for getting back to us. > > We had indeed several disk failures at the same time. > > Regarding the OSD map, we have an OSD that failed and we needed to remove but > we didn't update the crushmap. > > The question here, is it safe to update the OSD crushmap without affecting > the data available? > > We can free up more space on the monitors if that will help indeed. > > More information which can be helpful: > > # ceph -v > ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable) > > # ceph health detail > https://pastebin.pl/view/2b8b337d > > # ceph osd pool ls detail > pool 3 'cephfs-data' erasure size 12 min_size 11 crush_rule 1 object_hash > rjenkins pg_num 3072 pgp_num 3072 last_change 370219 lfor 0/367599 flags > hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 40960 fast_read 1 > compression_algorithm snappy compression_mode force application cephfs > removed_snaps [2~7c] > pool 4 'cephfs-meta' replicated size 3 min_size 2 crush_rule 0 object_hash > rjenkins pg_num 1024 pgp_num 1024 last_change 370219 lfor 0/367414 flags > hashpspool stripe_width 0 compression_algorithm none compression_mode none > application cephfs > > # ceph osd tree > https://pastebin.pl/view/eac56017 > > Our main struggle is when we try to rsync data, the rsync process hangs > because it encounters an inaccessible object. > > Is there a way we can take out the incomplete PGs to be able to copy data > smoothly without having to reset the rsync process? > > Kind regards, > adel > > -----Original Message----- > From: Reed Dier <reed.d...@focusvq.com> > Sent: Tuesday, June 15, 2021 4:21 PM > To: Aly, Adel <adel....@atos.net> > Cc: ceph-users@ceph.io > Subject: Re: [ceph-users] ceph PGs issues > > Caution! External email. Do not open attachments or click links, unless this > email comes from a known sender and you know the content is safe. > > You have incomplete PGs, which means you have inactive data, because the data > isn't there. > > This will typically only happen when you have multiple concurrent disk > failures, or something like that, so I think there is some missing info. > >> 1 osds exist in the crush map but not in the osdmap > > This seems like a red flag to have an OSD in the crush map but not the osdmap. > >> mons xyz01,xyz02 are low on available space > > Your mons are probably filling up data running in the warn state. > This can be problematic for recovery. > > I think you will be more likely to receive some useful suggestions by > providing things like which version of ceph you are using ($ ceph -v), major > events that caused this, poo ($ ceph osd pool ls detail) and osd ($ ceph osd > tree) topology, as well as maybe detailed health output ($ ceph health > detail). > > Given how much data some things may be, like the osd tree, you may want to > paste to pastebin and link here. > > Reed > >> On Jun 15, 2021, at 2:48 AM, Aly, Adel <adel....@atos.net> wrote: >> >> Dears, >> >> We have a ceph cluster with 4096 PGs out of with +100 PGs are not >> active+clean. >> >> On top of the ceph cluster, we have a ceph FS, with 3 active MDS servers. >> >> It seems that we can’t get all the files out of it because of the affected >> PGs. >> >> The object store has more than 400 million objects. >> >> When we do “rados -p cephfs-data ls”, the listing stops (hangs) after >> listing +11 million objects. >> >> When we try to access an object which we can’t copy, the rados command hangs >> forever: >> >> ls -I <filename> >> 2199140525188 >> >> printf "%x\n" 2199140525188 >> 20006fd6484 >> >> rados -p cephfs-data stat 20006fd6484.00000000 (hangs here) >> >> This is the current status of the ceph cluster: >> health: HEALTH_WARN >> 1 MDSs report slow metadata IOs >> 1 MDSs report slow requests >> 1 MDSs behind on trimming >> 1 osds exist in the crush map but not in the osdmap >> *Reduced data availability: 22 pgs inactive, 22 pgs incomplete* >> 240324 slow ops, oldest one blocked for 391503 sec, daemons >> [osd.144,osd.159,osd.180,osd.184,osd.242,osd.271,osd.275,osd.278,osd.280,osd.332]... >> h ave slow ops. >> mons xyz01,xyz02 are low on available space >> >> services: >> mon: 4 daemons, quorum abc001,abc002,xyz02,xyz01 >> mgr: abc002(active), standbys: xyz01, xyz02, abc001 >> mds: cephfs-3/3/3 up >> {0=xyz02=up:active,1=abc001=up:active,2=abc002=up:active}, 1 up:standby >> osd: 421 osds: 421 up, 421 in; 7 remapped pgs >> >> data: >> pools: 2 pools, 4096 pgs >> objects: 403.4 M objects, 846 TiB >> usage: 1.2 PiB used, 1.4 PiB / 2.6 PiB avail >> pgs: 0.537% pgs not active >> 3968 active+clean >> 96 active+clean+scrubbing+deep+repair >> 15 incomplete >> 10 active+clean+scrubbing >> 7 remapped+incomplete >> >> io: >> client: 89 KiB/s rd, 13 KiB/s wr, 34 op/s rd, 1 op/s wr >> >> The 100+ PGs have been in this state for a long time already. >> >> Sometimes when we try to copy some files the rsync process hangs and we >> can’t kill it and from the process stack, it seems to be hanging on ceph i/o >> operation. >> >> # cat /proc/51795/stack >> [<ffffffffc184206d>] ceph_mdsc_do_request+0xfd/0x280 [ceph] >> [<ffffffffc181e92e>] __ceph_do_getattr+0x9e/0x200 [ceph] >> [<ffffffffc181eb08>] ceph_getattr+0x28/0x100 [ceph] >> [<ffffffffab853689>] vfs_getattr+0x49/0x80 [<ffffffffab8537b5>] >> vfs_fstatat+0x75/0xc0 [<ffffffffab853bc1>] SYSC_newlstat+0x31/0x60 >> [<ffffffffab85402e>] SyS_newlstat+0xe/0x10 [<ffffffffabd93f92>] >> system_call_fastpath+0x25/0x2a [<ffffffffffffffff>] 0xffffffffffffffff >> >> # cat /proc/51795/mem >> cat: /proc/51795/mem: Input/output error >> >> Any idea on how to move forward with debugging and fixing this issue so we >> can get the data out of the ceph FS? >> >> Thank you in advance. >> >> Kind regards, >> adel >> >> This e-mail and the documents attached are confidential and intended solely >> for the addressee; it may also be privileged. If you receive this e-mail in >> error, please notify the sender immediately and destroy it. As its integrity >> cannot be secured on the Internet, Atos’ liability cannot be triggered for >> the message content. Although the sender endeavours to maintain a computer >> virus-free network, the sender does not warrant that this transmission is >> virus-free and will not be liable for any damages resulting from any virus >> transmitted. On all offers and agreements under which Atos Nederland B.V. >> supplies goods and/or services of whatever nature, the Terms of Delivery >> from Atos Nederland B.V. exclusively apply. The Terms of Delivery shall be >> promptly submitted to you on your request. >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an >> email to ceph-users-le...@ceph.io > > This e-mail and the documents attached are confidential and intended solely > for the addressee; it may also be privileged. If you receive this e-mail in > error, please notify the sender immediately and destroy it. As its integrity > cannot be secured on the Internet, Atos’ liability cannot be triggered for > the message content. Although the sender endeavours to maintain a computer > virus-free network, the sender does not warrant that this transmission is > virus-free and will not be liable for any damages resulting from any virus > transmitted. On all offers and agreements under which Atos Nederland B.V. > supplies goods and/or services of whatever nature, the Terms of Delivery from > Atos Nederland B.V. exclusively apply. The Terms of Delivery shall be > promptly submitted to you on your request. _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io