Re: [ceph-users] pg is stuck stale (osd.21 still removed)

2016-01-13 Thread Daniel Schwager
Hi ceph-users,

any idea to fix my cluster? OSD.21 removed, but still some (staled) PG's 
pointing to OSD.21...

I don't know how to proceed... Help is very welcome!

Best regards
Daniel


> -Original Message-
> From: Daniel Schwager
> Sent: Friday, January 08, 2016 3:10 PM
> To: 'ceph-us...@ceph.com'
> Subject: pg is stuck stale (osd.21 still removed)
> 
> Hi,
> 
> we had a HW-problem with OSD.21 today. The OSD daemon was down and "smartctl" 
> told me about some
> hardware errors.
> 
> I decided to remove the HDD:
> 
>   ceph osd out 21
>   ceph osd crush remove osd.21
>   ceph auth del osd.21
>   ceph osd rm osd.21
> 
> But afterwards I saw that I have some stucked pg's for osd.21:
> 
>   root@ceph-admin:~# ceph -w
>   cluster c7b12656-15a6-41b0-963f-4f47c62497dc
>health HEALTH_WARN
> 50 pgs stale
>   50 pgs stuck stale
>monmap e4: 3 mons at 
> {ceph-mon1=192.168.135.31:6789/0,ceph-mon2=192.168.135.32:6789/0,ceph-
> mon3=192.168.135.33:6789/0}
> election epoch 404, quorum 0,1,2 
> ceph-mon1,ceph-mon2,ceph-mon3
>mdsmap e136: 1/1/1 up {0=ceph-mon1=up:active}
>osdmap e18259: 23 osds: 23 up, 23 in
> pgmap v47879105: 6656 pgs, 10 pools, 23481 GB data, 6072 kobjects
> 54974 GB used, 30596 GB / 85571 GB avail
>   6605 active+clean
>   50 stale+active+clean
>  1 active+clean+scrubbing+deep
> 
>   root@ceph-admin:~# ceph health
>   HEALTH_WARN 50 pgs stale; 50 pgs stuck stale
> 
>   root@ceph-admin:~# ceph health detail
>   HEALTH_WARN 50 pgs stale; 50 pgs stuck stale; noout flag(s) set
>   pg 34.225 is stuck stale for 98780.399254, current state 
> stale+active+clean, last acting [21]
>   pg 34.186 is stuck stale for 98780.399195, current state 
> stale+active+clean, last acting [21]
>   ...
> 
>   root@ceph-admin:~# ceph pg 34.225   query
>   Error ENOENT: i don't have pgid 34.225
> 
>   root@ceph-admin:~# ceph pg 34.225  list_missing
>   Error ENOENT: i don't have pgid 34.225
> 
>   root@ceph-admin:~# ceph osd lost 21  --yes-i-really-mean-it
>   osd.21 is not down or doesn't exist
> 
>   # checking the crushmap
>   ceph osd getcrushmap -o crush.map
>   crushtool -d crush.map  -o crush.txt
>   root@ceph-admin:~# grep 21 crush.txt
>   -> nothing here
> 
> 
> Of course, I cannot start OSD.21, because it's not available anymore - I 
> removed it.
> 
> Is there a way to remap the stucked pg's to other OSD's than osd.21



> 
> One more - I tried to recreate the pg but now this pg this "stuck inactive":
> 
>   root@ceph-admin:~# ceph pg force_create_pg 34.225
>   pg 34.225 now creating, ok
> 
>   root@ceph-admin:~# ceph health detail
>   HEALTH_WARN 49 pgs stale; 1 pgs stuck inactive; 49 pgs stuck stale; 1 
> pgs stuck unclean
>   pg 34.225 is stuck inactive since forever, current state creating, last 
> acting []
>   pg 34.225 is stuck unclean since forever, current state creating, last 
> acting []
>   pg 34.186 is stuck stale for 118481.013632, current state 
> stale+active+clean, last acting [21]
>   ...
> 
> Maybe somebody has an idea how to fix this situation?


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg is stuck stale (osd.21 still removed)

2016-01-13 Thread Alex Gorbachev
Hi Daniel,

On Friday, January 8, 2016, Daniel Schwager 
wrote:

> One more - I tried to recreate the pg but now this pg this "stuck
> inactive":
>
> root@ceph-admin:~# ceph pg force_create_pg 34.225
> pg 34.225 now creating, ok
>
> root@ceph-admin:~# ceph health detail
> HEALTH_WARN 49 pgs stale; 1 pgs stuck inactive; 49 pgs stuck
> stale; 1 pgs stuck unclean
> pg 34.225 is stuck inactive since forever, current state creating,
> last acting []
> pg 34.225 is stuck unclean since forever, current state creating,
> last acting []
> pg 34.186 is stuck stale for 118481.013632, current state
> stale+active+clean, last acting [21]
> ...
>
> Maybe somebody has an idea how to fix this situation?


I don't unfortunately have the answers, but maybe the following links will
help you make some progress:

http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/

https://www.mail-archive.com/ceph-users@lists.ceph.com/msg17820.html

https://ceph.com/community/incomplete-pgs-oh-my/

Good luck,
Alex


>
> regards
> Danny
>
>
>

-- 
--
Alex Gorbachev
Storcium
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg is stuck stale (osd.21 still removed) - SOLVED.

2016-01-13 Thread Daniel Schwager
Well, ok - I found the solution:

ceph health detail
HEALTH_WARN 50 pgs stale; 50 pgs stuck stale
pg 34.225 is stuck inactive since forever, current state 
creating, last acting []
pg 34.225 is stuck unclean since forever, current state 
creating, last acting []
pg 34.226 is stuck stale for 77328.923060, current state 
stale+active+clean, last acting [21]
pg 34.3cb is stuck stale for 77328.923213, current state 
stale+active+clean, last acting [21]


root@ceph-admin:~# ceph pg map 34.225
osdmap e18263 pg 34.225 (34.225) -> up [16] acting [16]

After restart osd.16, pg 34.225 is fine.

So, I recreate all the broken PG's:
for pg in `ceph health detail | grep stale | cut -d' ' -f2`; do ceph pg 
force_create_pg $pg; done

and restart all (or the necessary) OSD's..

Now, the cluster is HEALTH_OK again.
root@ceph-admin:~# ceph  health
HEALTH_OK

Best regards
Danny


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg is stuck stale (osd.21 still removed)

2016-01-08 Thread Daniel Schwager
One more - I tried to recreate the pg but now this pg this "stuck inactive":

root@ceph-admin:~# ceph pg force_create_pg 34.225
pg 34.225 now creating, ok

root@ceph-admin:~# ceph health detail
HEALTH_WARN 49 pgs stale; 1 pgs stuck inactive; 49 pgs stuck stale; 1 
pgs stuck unclean
pg 34.225 is stuck inactive since forever, current state creating, last 
acting []
pg 34.225 is stuck unclean since forever, current state creating, last 
acting []
pg 34.186 is stuck stale for 118481.013632, current state 
stale+active+clean, last acting [21]
...

Maybe somebody has an idea how to fix this situation?

regards
Danny




smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] pg is stuck stale (osd.21 still removed)

2016-01-08 Thread Daniel Schwager
Hi,

we had a HW-problem with OSD.21 today. The OSD daemon was down and "smartctl" 
told me about some hardware errors.

I decided to remove the HDD:

  ceph osd out 21
  ceph osd crush remove osd.21
  ceph auth del osd.21
  ceph osd rm osd.21

But afterwards I saw that I have some stucked pg's for osd.21: 

root@ceph-admin:~# ceph -w
cluster c7b12656-15a6-41b0-963f-4f47c62497dc
 health HEALTH_WARN
  50 pgs stale
50 pgs stuck stale
 monmap e4: 3 mons at 
{ceph-mon1=192.168.135.31:6789/0,ceph-mon2=192.168.135.32:6789/0,ceph-mon3=192.168.135.33:6789/0}
  election epoch 404, quorum 0,1,2 ceph-mon1,ceph-mon2,ceph-mon3
 mdsmap e136: 1/1/1 up {0=ceph-mon1=up:active}
 osdmap e18259: 23 osds: 23 up, 23 in
  pgmap v47879105: 6656 pgs, 10 pools, 23481 GB data, 6072 kobjects
  54974 GB used, 30596 GB / 85571 GB avail
6605 active+clean
50 stale+active+clean
   1 active+clean+scrubbing+deep

root@ceph-admin:~# ceph health
HEALTH_WARN 50 pgs stale; 50 pgs stuck stale

root@ceph-admin:~# ceph health detail
HEALTH_WARN 50 pgs stale; 50 pgs stuck stale; noout flag(s) set
pg 34.225 is stuck stale for 98780.399254, current state 
stale+active+clean, last acting [21]
pg 34.186 is stuck stale for 98780.399195, current state 
stale+active+clean, last acting [21]
...

root@ceph-admin:~# ceph pg 34.225   query
Error ENOENT: i don't have pgid 34.225

root@ceph-admin:~# ceph pg 34.225  list_missing
Error ENOENT: i don't have pgid 34.225

root@ceph-admin:~# ceph osd lost 21  --yes-i-really-mean-it
osd.21 is not down or doesn't exist

# checking the crushmap
  ceph osd getcrushmap -o crush.map
  crushtool -d crush.map  -o crush.txt
root@ceph-admin:~# grep 21 crush.txt
-> nothing here


Of course, I cannot start OSD.21, because it's not available anymore - I 
removed it.

Is there a way to remap the stucked pg's to other OSD's than osd.21? How can I 
help my cluster (ceph 0.94.2)?

best regards
Danny


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com