Re: [ceph-users] [need your help] How to Fix unclean PG
This looks fine and will recover on its own. If you are not seeing enough client IO means that your tuning for recovery IO vs client IO priority is incorrect. A simple and effective way is increasing the osd_recovery_sleep_hdd option (I think the default is 0.05 in Luminous and 0.1 since Mimic?) which throttles recovery speed. Paul 2018-09-15 17:31 GMT+02:00 Frank Yu : > Hi Paul, > > before I upgrade, there are 17 osd server, (8 osd per server), 3 mds/rgw, 2 > active mds, then I add 5 osd server(16 osd per server), then one active > server crash( and I reboot it), the mds can't come back to health anymore, > So, I add two new mds server, and delete one of the original the mds server. > First I set the new osd crush weight to 1, ( 6TB per osd), the cluster doing > balance. before the balance finished, I change the weight to 5.45798. > > more info as below > > > # ceph -s > cluster: > id: a00cc99c-f9f9-4dd9-9281-43cd12310e41 > health: HEALTH_WARN > 28750646/577747527 objects misplaced (4.976%) > Degraded data redundancy: 2724676/577747527 objects degraded > (0.472%), 1476 pgs unclean, 451 pgs degraded, 356 pgs undersized > > services: > mon: 3 daemons, quorum ark0008,ark0009,ark0010 > mgr: ark0009(active), standbys: ark0010, ark0008, ark0008.hobot.cc > mds: cephfs-2/2/1 up > {0=ark0018.hobot.cc=up:active,1=ark0020.hobot.cc=up:active}, 2 up:standby > osd: 213 osds: 213 up, 209 in; 1433 remapped pgs > rgw: 1 daemon active > > data: > pools: 17 pools, 10324 pgs > objects: 183M objects, 124 TB > usage: 425 TB used, 479 TB / 904 TB avail > pgs: 2724676/577747527 objects degraded (0.472%) > 28750646/577747527 objects misplaced (4.976%) > 8848 active+clean > 565 active+remapped+backfilling > 449 active+remapped+backfill_wait > 319 active+undersized+degraded+remapped+backfilling > 36 active+undersized+degraded+remapped+backfill_wait > 32 active+recovery_wait+degraded > 29 active+recovery_wait+degraded+remapped > 20 active+degraded+remapped+backfill_wait > 14 active+degraded+remapped+backfilling > 11 active+recovery_wait > 1active+recovery_wait+undersized+degraded+remapped > > io: > client: 2356 B/s rd, 9051 kB/s wr, 0 op/s rd, 185 op/s wr > recovery: 459 MB/s, 709 objects/s > > # ceph health detail > HEALTH_WARN 28736684/577747554 objects misplaced (4.974%); Degraded data > redundancy: 2722451/577747554 objects degraded (0.471%), 1475 pgs unclean, > 451 pgs degraded, 356 pgs undersized > pg 5.dee is stuck unclean for 93114.056729, current state > active+remapped+backfilling, last acting [19,153,64] > pg 5.df4 is stuck undersized for 86028.395042, current state > active+undersized+degraded+remapped+backfilling, last acting [81,83] > pg 5.df8 is stuck unclean for 10529.471700, current state > active+remapped+backfilling, last acting [53,212,106] > pg 5.dfa is stuck unclean for 86193.279939, current state > active+remapped+backfill_wait, last acting [58,122,98] > pg 5.dfd is stuck unclean for 21944.059088, current state > active+remapped+backfilling, last acting [119,91,22] > pg 5.e01 is stuck undersized for 73773.177963, current state > active+undersized+degraded+remapped+backfilling, last acting [88,116] > pg 5.e02 is stuck undersized for 10615.864226, current state > active+undersized+degraded+remapped+backfilling, last acting [112,110] > pg 5.e04 is active+degraded+remapped+backfilling, acting [44,10,104] > pg 5.e07 is stuck undersized for 86060.059937, current state > active+undersized+degraded+remapped+backfilling, last acting [100,65] > pg 5.e09 is stuck unclean for 86247.708352, current state > active+remapped+backfilling, last acting [19,187,46] > pg 5.e0a is stuck unclean for 93073.574629, current state > active+remapped+backfilling, last acting [92,13,118] > pg 5.e0b is stuck unclean for 86247.949138, current state > active+remapped+backfilling, last acting [31,54,68] > pg 5.e10 is stuck unclean for 17390.342397, current state > active+remapped+backfill_wait, last acting [71,202,119] > pg 5.e13 is stuck unclean for 93092.549049, current state > active+remapped+backfilling, last acting [33,90,110] > pg 5.e16 is stuck unclean for 86250.883911, current state > active+remapped+backfill_wait, last acting [79,108,56] > pg 5.e17 is stuck undersized for 15167.783137, current state > active+undersized+degraded+remapped+backfill_wait, last acting [42,28] > pg 5.e18 is stuck unclean for 18122.375128, current state > active+remapped+backfill_wait, last acting [26,43,31] > pg 5.e20 is stuck unclean for 86255.524287, current state > active+remapped+backfilling, last acting [122,52,7] > pg 5.e27 is stuck unclean for 10706.283143, current state > active+remapped+backfill_wait,
Re: [ceph-users] [need your help] How to Fix unclean PG
Hi Paul, before I upgrade, there are 17 osd server, (8 osd per server), 3 mds/rgw, 2 active mds, then I add 5 osd server(16 osd per server), then one active server crash( and I reboot it), the mds can't come back to health anymore, So, I add two new mds server, and delete one of the original the mds server. First I set the new osd crush weight to 1, ( 6TB per osd), the cluster doing balance. before the balance finished, I change the weight to 5.45798. more info as below # ceph -s cluster: id: a00cc99c-f9f9-4dd9-9281-43cd12310e41 health: HEALTH_WARN 28750646/577747527 objects misplaced (4.976%) Degraded data redundancy: 2724676/577747527 objects degraded (0.472%), 1476 pgs unclean, 451 pgs degraded, 356 pgs undersized services: mon: 3 daemons, quorum ark0008,ark0009,ark0010 mgr: ark0009(active), standbys: ark0010, ark0008, ark0008.hobot.cc mds: cephfs-2/2/1 up {0=ark0018.hobot.cc=up:active,1=ark0020.hobot.cc=up:active}, 2 up:standby osd: 213 osds: 213 up, 209 in; 1433 remapped pgs rgw: 1 daemon active data: pools: 17 pools, 10324 pgs objects: 183M objects, 124 TB usage: 425 TB used, 479 TB / 904 TB avail pgs: 2724676/577747527 objects degraded (0.472%) 28750646/577747527 objects misplaced (4.976%) 8848 active+clean 565 active+remapped+backfilling 449 active+remapped+backfill_wait 319 active+undersized+degraded+remapped+backfilling 36 active+undersized+degraded+remapped+backfill_wait 32 active+recovery_wait+degraded 29 active+recovery_wait+degraded+remapped 20 active+degraded+remapped+backfill_wait 14 active+degraded+remapped+backfilling 11 active+recovery_wait 1active+recovery_wait+undersized+degraded+remapped io: client: 2356 B/s rd, 9051 kB/s wr, 0 op/s rd, 185 op/s wr recovery: 459 MB/s, 709 objects/s # ceph health detail HEALTH_WARN 28736684/577747554 objects misplaced (4.974%); Degraded data redundancy: 2722451/577747554 objects degraded (0.471%), 1475 pgs unclean, 451 pgs degraded, 356 pgs undersized pg 5.dee is stuck unclean for 93114.056729, current state active+remapped+backfilling, last acting [19,153,64] pg 5.df4 is stuck undersized for 86028.395042, current state active+undersized+degraded+remapped+backfilling, last acting [81,83] pg 5.df8 is stuck unclean for 10529.471700, current state active+remapped+backfilling, last acting [53,212,106] pg 5.dfa is stuck unclean for 86193.279939, current state active+remapped+backfill_wait, last acting [58,122,98] pg 5.dfd is stuck unclean for 21944.059088, current state active+remapped+backfilling, last acting [119,91,22] pg 5.e01 is stuck undersized for 73773.177963, current state active+undersized+degraded+remapped+backfilling, last acting [88,116] pg 5.e02 is stuck undersized for 10615.864226, current state active+undersized+degraded+remapped+backfilling, last acting [112,110] pg 5.e04 is active+degraded+remapped+backfilling, acting [44,10,104] pg 5.e07 is stuck undersized for 86060.059937, current state active+undersized+degraded+remapped+backfilling, last acting [100,65] pg 5.e09 is stuck unclean for 86247.708352, current state active+remapped+backfilling, last acting [19,187,46] pg 5.e0a is stuck unclean for 93073.574629, current state active+remapped+backfilling, last acting [92,13,118] pg 5.e0b is stuck unclean for 86247.949138, current state active+remapped+backfilling, last acting [31,54,68] pg 5.e10 is stuck unclean for 17390.342397, current state active+remapped+backfill_wait, last acting [71,202,119] pg 5.e13 is stuck unclean for 93092.549049, current state active+remapped+backfilling, last acting [33,90,110] pg 5.e16 is stuck unclean for 86250.883911, current state active+remapped+backfill_wait, last acting [79,108,56] pg 5.e17 is stuck undersized for 15167.783137, current state active+undersized+degraded+remapped+backfill_wait, last acting [42,28] pg 5.e18 is stuck unclean for 18122.375128, current state active+remapped+backfill_wait, last acting [26,43,31] pg 5.e20 is stuck unclean for 86255.524287, current state active+remapped+backfilling, last acting [122,52,7] pg 5.e27 is stuck unclean for 10706.283143, current state active+remapped+backfill_wait, last acting [56,104,73] pg 5.e29 is stuck undersized for 86036.590643, current state active+undersized+degraded+remapped+backfilling, last acting [49,35] pg 5.e2c is stuck unclean for 86257.751565, current state active+remapped+backfilling, last acting [70,106,91] pg 5.e2e is stuck undersized for 10615.804510, current state active+undersized+degraded+remapped+backfilling, last acting [35,103] pg 5.e32 is stuck undersized for 74758.649684, current state active+undersized+degraded+remapped+backfilling, last acting [39,53] pg 5.e35 is stuck unclean
Re: [ceph-users] [need your help] How to Fix unclean PG
Well, that's not a lot of information to troubleshoot such a problem. Please post the output of the following commands: * ceph -s * ceph health detail * ceph osd pool ls detail * ceph osd tree * ceph osd df tree * ceph versions And a description of what you did to upgrade it. Paul 2018-09-15 15:46 GMT+02:00 Frank Yu : > Hello there, > > I have a ceph cluster which increase from 400TB to 900 TB recently, now the > cluster is in unhealthy status, there're about 1700+ pg in unclean status > > # ceph pg dump_stuck unclean|wc > ok >1696 10176 191648 > > the cephfs can't work anymore, the read io was no more than MB/s. > Is there any way to fix the unclean pg quickly? > > > > > -- > Regards > Frank Yu > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] [need your help] How to Fix unclean PG
Hello there, I have a ceph cluster which increase from 400TB to 900 TB recently, now the cluster is in unhealthy status, there're about 1700+ pg in unclean status # ceph pg dump_stuck unclean|wc ok 1696 10176 191648 the cephfs can't work anymore, the read io was no more than MB/s. Is there any way to fix the unclean pg quickly? -- Regards Frank Yu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com