Re: [ceph-users] [need your help] How to Fix unclean PG

2018-09-17 Thread Paul Emmerich
This looks fine and will recover on its own.
If you are not seeing enough client IO means that your tuning for
recovery IO vs client IO priority is incorrect.
A simple and effective way is increasing the osd_recovery_sleep_hdd
option (I think the default is 0.05 in Luminous and 0.1 since Mimic?)
which throttles recovery speed.

Paul

2018-09-15 17:31 GMT+02:00 Frank Yu :
> Hi Paul,
>
> before I upgrade, there are 17 osd server, (8 osd per server), 3 mds/rgw, 2
> active mds, then I add 5 osd server(16 osd per server), then one active
> server crash( and I reboot it), the mds can't come back to health anymore,
> So, I add two new mds server, and delete one of the original the mds server.
> First I set the new osd crush weight to 1, ( 6TB per osd), the cluster doing
> balance. before the balance finished, I change the weight to 5.45798.
>
> more info as below
>
> 
> # ceph -s
>   cluster:
> id: a00cc99c-f9f9-4dd9-9281-43cd12310e41
> health: HEALTH_WARN
> 28750646/577747527 objects misplaced (4.976%)
> Degraded data redundancy: 2724676/577747527 objects degraded
> (0.472%), 1476 pgs unclean, 451 pgs degraded, 356 pgs undersized
>
>   services:
> mon: 3 daemons, quorum ark0008,ark0009,ark0010
> mgr: ark0009(active), standbys: ark0010, ark0008, ark0008.hobot.cc
> mds: cephfs-2/2/1 up
> {0=ark0018.hobot.cc=up:active,1=ark0020.hobot.cc=up:active}, 2 up:standby
> osd: 213 osds: 213 up, 209 in; 1433 remapped pgs
> rgw: 1 daemon active
>
>   data:
> pools:   17 pools, 10324 pgs
> objects: 183M objects, 124 TB
> usage:   425 TB used, 479 TB / 904 TB avail
> pgs: 2724676/577747527 objects degraded (0.472%)
>  28750646/577747527 objects misplaced (4.976%)
>  8848 active+clean
>  565  active+remapped+backfilling
>  449  active+remapped+backfill_wait
>  319  active+undersized+degraded+remapped+backfilling
>  36   active+undersized+degraded+remapped+backfill_wait
>  32   active+recovery_wait+degraded
>  29   active+recovery_wait+degraded+remapped
>  20   active+degraded+remapped+backfill_wait
>  14   active+degraded+remapped+backfilling
>  11   active+recovery_wait
>  1active+recovery_wait+undersized+degraded+remapped
>
>   io:
> client:   2356 B/s rd, 9051 kB/s wr, 0 op/s rd, 185 op/s wr
> recovery: 459 MB/s, 709 objects/s
> 
> # ceph health detail
> HEALTH_WARN 28736684/577747554 objects misplaced (4.974%); Degraded data
> redundancy: 2722451/577747554 objects degraded (0.471%), 1475 pgs unclean,
> 451 pgs degraded, 356 pgs undersized
> pg 5.dee is stuck unclean for 93114.056729, current state
> active+remapped+backfilling, last acting [19,153,64]
> pg 5.df4 is stuck undersized for 86028.395042, current state
> active+undersized+degraded+remapped+backfilling, last acting [81,83]
> pg 5.df8 is stuck unclean for 10529.471700, current state
> active+remapped+backfilling, last acting [53,212,106]
> pg 5.dfa is stuck unclean for 86193.279939, current state
> active+remapped+backfill_wait, last acting [58,122,98]
> pg 5.dfd is stuck unclean for 21944.059088, current state
> active+remapped+backfilling, last acting [119,91,22]
> pg 5.e01 is stuck undersized for 73773.177963, current state
> active+undersized+degraded+remapped+backfilling, last acting [88,116]
> pg 5.e02 is stuck undersized for 10615.864226, current state
> active+undersized+degraded+remapped+backfilling, last acting [112,110]
> pg 5.e04 is active+degraded+remapped+backfilling, acting [44,10,104]
> pg 5.e07 is stuck undersized for 86060.059937, current state
> active+undersized+degraded+remapped+backfilling, last acting [100,65]
> pg 5.e09 is stuck unclean for 86247.708352, current state
> active+remapped+backfilling, last acting [19,187,46]
> pg 5.e0a is stuck unclean for 93073.574629, current state
> active+remapped+backfilling, last acting [92,13,118]
> pg 5.e0b is stuck unclean for 86247.949138, current state
> active+remapped+backfilling, last acting [31,54,68]
> pg 5.e10 is stuck unclean for 17390.342397, current state
> active+remapped+backfill_wait, last acting [71,202,119]
> pg 5.e13 is stuck unclean for 93092.549049, current state
> active+remapped+backfilling, last acting [33,90,110]
> pg 5.e16 is stuck unclean for 86250.883911, current state
> active+remapped+backfill_wait, last acting [79,108,56]
> pg 5.e17 is stuck undersized for 15167.783137, current state
> active+undersized+degraded+remapped+backfill_wait, last acting [42,28]
> pg 5.e18 is stuck unclean for 18122.375128, current state
> active+remapped+backfill_wait, last acting [26,43,31]
> pg 5.e20 is stuck unclean for 86255.524287, current state
> active+remapped+backfilling, last acting [122,52,7]
> pg 5.e27 is stuck unclean for 10706.283143, current state
> active+remapped+backfill_wait, 

Re: [ceph-users] [need your help] How to Fix unclean PG

2018-09-15 Thread Frank Yu
Hi Paul,

before I upgrade, there are 17 osd server, (8 osd per server), 3 mds/rgw, 2
active mds, then I add 5 osd server(16 osd per server), then one active
server crash( and I reboot it), the mds can't come back to health anymore,
So, I add two new mds server, and delete one of the original the mds
server.  First I set the new osd crush weight to 1, ( 6TB per osd), the
cluster doing balance. before the balance finished, I change the weight
to 5.45798.

more info as below


# ceph -s
  cluster:
id: a00cc99c-f9f9-4dd9-9281-43cd12310e41
health: HEALTH_WARN
28750646/577747527 objects misplaced (4.976%)
Degraded data redundancy: 2724676/577747527 objects degraded
(0.472%), 1476 pgs unclean, 451 pgs degraded, 356 pgs undersized

  services:
mon: 3 daemons, quorum ark0008,ark0009,ark0010
mgr: ark0009(active), standbys: ark0010, ark0008, ark0008.hobot.cc
mds: cephfs-2/2/1 up
{0=ark0018.hobot.cc=up:active,1=ark0020.hobot.cc=up:active},
2 up:standby
osd: 213 osds: 213 up, 209 in; 1433 remapped pgs
rgw: 1 daemon active

  data:
pools:   17 pools, 10324 pgs
objects: 183M objects, 124 TB
usage:   425 TB used, 479 TB / 904 TB avail
pgs: 2724676/577747527 objects degraded (0.472%)
 28750646/577747527 objects misplaced (4.976%)
 8848 active+clean
 565  active+remapped+backfilling
 449  active+remapped+backfill_wait
 319  active+undersized+degraded+remapped+backfilling
 36   active+undersized+degraded+remapped+backfill_wait
 32   active+recovery_wait+degraded
 29   active+recovery_wait+degraded+remapped
 20   active+degraded+remapped+backfill_wait
 14   active+degraded+remapped+backfilling
 11   active+recovery_wait
 1active+recovery_wait+undersized+degraded+remapped

  io:
client:   2356 B/s rd, 9051 kB/s wr, 0 op/s rd, 185 op/s wr
recovery: 459 MB/s, 709 objects/s

# ceph health detail
HEALTH_WARN 28736684/577747554 objects misplaced (4.974%); Degraded data
redundancy: 2722451/577747554 objects degraded (0.471%), 1475 pgs unclean,
451 pgs degraded, 356 pgs undersized
pg 5.dee is stuck unclean for 93114.056729, current state
active+remapped+backfilling, last acting [19,153,64]
pg 5.df4 is stuck undersized for 86028.395042, current state
active+undersized+degraded+remapped+backfilling, last acting [81,83]
pg 5.df8 is stuck unclean for 10529.471700, current state
active+remapped+backfilling, last acting [53,212,106]
pg 5.dfa is stuck unclean for 86193.279939, current state
active+remapped+backfill_wait, last acting [58,122,98]
pg 5.dfd is stuck unclean for 21944.059088, current state
active+remapped+backfilling, last acting [119,91,22]
pg 5.e01 is stuck undersized for 73773.177963, current state
active+undersized+degraded+remapped+backfilling, last acting [88,116]
pg 5.e02 is stuck undersized for 10615.864226, current state
active+undersized+degraded+remapped+backfilling, last acting [112,110]
pg 5.e04 is active+degraded+remapped+backfilling, acting [44,10,104]
pg 5.e07 is stuck undersized for 86060.059937, current state
active+undersized+degraded+remapped+backfilling, last acting [100,65]
pg 5.e09 is stuck unclean for 86247.708352, current state
active+remapped+backfilling, last acting [19,187,46]
pg 5.e0a is stuck unclean for 93073.574629, current state
active+remapped+backfilling, last acting [92,13,118]
pg 5.e0b is stuck unclean for 86247.949138, current state
active+remapped+backfilling, last acting [31,54,68]
pg 5.e10 is stuck unclean for 17390.342397, current state
active+remapped+backfill_wait, last acting [71,202,119]
pg 5.e13 is stuck unclean for 93092.549049, current state
active+remapped+backfilling, last acting [33,90,110]
pg 5.e16 is stuck unclean for 86250.883911, current state
active+remapped+backfill_wait, last acting [79,108,56]
pg 5.e17 is stuck undersized for 15167.783137, current state
active+undersized+degraded+remapped+backfill_wait, last acting [42,28]
pg 5.e18 is stuck unclean for 18122.375128, current state
active+remapped+backfill_wait, last acting [26,43,31]
pg 5.e20 is stuck unclean for 86255.524287, current state
active+remapped+backfilling, last acting [122,52,7]
pg 5.e27 is stuck unclean for 10706.283143, current state
active+remapped+backfill_wait, last acting [56,104,73]
pg 5.e29 is stuck undersized for 86036.590643, current state
active+undersized+degraded+remapped+backfilling, last acting [49,35]
pg 5.e2c is stuck unclean for 86257.751565, current state
active+remapped+backfilling, last acting [70,106,91]
pg 5.e2e is stuck undersized for 10615.804510, current state
active+undersized+degraded+remapped+backfilling, last acting [35,103]
pg 5.e32 is stuck undersized for 74758.649684, current state
active+undersized+degraded+remapped+backfilling, last acting [39,53]
pg 5.e35 is stuck unclean 

Re: [ceph-users] [need your help] How to Fix unclean PG

2018-09-15 Thread Paul Emmerich
Well, that's not a lot of information to troubleshoot such a problem.

Please post the output of the following commands:

* ceph -s
* ceph health detail
* ceph osd pool ls detail
* ceph osd tree
* ceph osd df tree
* ceph versions

And a description of what you did to upgrade it.

Paul


2018-09-15 15:46 GMT+02:00 Frank Yu :
> Hello there,
>
> I have a ceph cluster which increase from 400TB to 900 TB recently, now the
> cluster is in unhealthy status,  there're about 1700+ pg in unclean status
>
> # ceph pg dump_stuck unclean|wc
> ok
>1696   10176  191648
>
> the cephfs can't work anymore,  the read io was no more than MB/s.
> Is there any way to fix the unclean pg quickly?
>
>
>
>
> --
> Regards
> Frank Yu
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [need your help] How to Fix unclean PG

2018-09-15 Thread Frank Yu
Hello there,

I have a ceph cluster which increase from 400TB to 900 TB recently, now the
cluster is in unhealthy status,  there're about 1700+ pg in unclean status

# ceph pg dump_stuck unclean|wc
ok
   1696   10176  191648

the cephfs can't work anymore,  the read io was no more than MB/s.
Is there any way to fix the unclean pg quickly?




-- 
Regards
Frank Yu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com