Re: [ceph-users] ceph pgs state forever stale+active+clean

Hyun Ha Sun, 20 Aug 2017 22:09:00 -0700

Hi, Thank you for response.

Details of my pool is below:
pool 2 'volumes' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 128 pgp_num 128 last_change 627 flags hashpspool
stripe_width 0
        removed_snaps [1~3]


My test case was about a scenario of disaster. I think that the situation
that all copies of data is deleted can be occurred in production (in my
test, I deleted all copy of data by myself because simulate disaster state).

When all copy of data is deleted, ceph cluster never get back to clean.
How can I recover in this situation?

Thank you.


2017-08-18 21:28 GMT+09:00 David Turner <[email protected]>:

> What were the settings for your pool? What was the size?  It looks like
> the size was 2 and that the PGs only existed on osds 2 and 6. If that's the
> case, it's like having a 4 disk raid 1+0, removing 2 disks of the same
> mirror, and complaining that the other mirror didn't pick up the data...
> Don't delete all copies of your data.  If your replica size is 2, you
> cannot loose 2 disks at the same time.
>
> On Fri, Aug 18, 2017, 1:28 AM Hyun Ha <[email protected]> wrote:
>
>> Hi, Cephers!
>>
>> I'm currently testing the situation of double failure for ceph cluster.
>> But, I faced that pgs are in stale state forever.
>>
>> reproduce steps)
>> 0. ceph version : jewel 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
>> 1. Pool create : exp-volumes (size = 2, min_size = 1)
>> 2. rbd create : testvol01
>> 3. rbd map and create mkfs.xfs
>> 4. mount and create file
>> 5. list rados object
>> 6. check osd map of each object
>>  # ceph osd map exp-volumes rbd_data.4a41f238e1f29.000000000000017a
>>    osdmap e199 pool 'exp-volumes' (2) object 
>> 'rbd_data.4a41f238e1f29.000000000000017a'
>> -> pg 2.3f04d6e2 (2.62) -> up ([2,6], p2) acting ([2,6], p2)
>> 7. stop primary osd.2 and secondary osd.6 of above object at the same
>> time
>> 8. check ceph status
>> health HEALTH_ERR
>>             16 pgs are stuck inactive for more than 300 seconds
>>             16 pgs stale
>>             16 pgs stuck stale
>>      monmap e11: 3 mons at {10.105.176.85=10.105.176.85:
>> 6789/0,10.110.248.154=10.110.248.154:6789/0,10.110.249.153=
>> 10.110.249.153:6789/0}
>>             election epoch 84, quorum 0,1,2 10.105.176.85,10.110.248.154,
>> 10.110.249.153
>>      osdmap e248: 6 osds: 4 up, 4 in; 16 remapped pgs
>>             flags sortbitwise,require_jewel_osds
>>       pgmap v112095: 128 pgs, 1 pools, 14659 kB data, 17 objects
>>             165 MB used, 159 GB / 160 GB avail
>>                  112 active+clean
>>                   16 stale+active+clean
>>
>> # ceph health detail
>> HEALTH_ERR 16 pgs are stuck inactive for more than 300 seconds; 16 pgs
>> stale; 16 pgs stuck stale
>> pg 2.67 is stuck stale for 689.171742, current state stale+active+clean,
>> last acting [2,6]
>> pg 2.5a is stuck stale for 689.171748, current state stale+active+clean,
>> last acting [6,2]
>> pg 2.52 is stuck stale for 689.171753, current state stale+active+clean,
>> last acting [2,6]
>> pg 2.4d is stuck stale for 689.171757, current state stale+active+clean,
>> last acting [2,6]
>> pg 2.56 is stuck stale for 689.171755, current state stale+active+clean,
>> last acting [6,2]
>> pg 2.d is stuck stale for 689.171811, current state stale+active+clean,
>> last acting [6,2]
>> pg 2.79 is stuck stale for 689.171808, current state stale+active+clean,
>> last acting [2,6]
>> pg 2.1f is stuck stale for 689.171782, current state stale+active+clean,
>> last acting [6,2]
>> pg 2.76 is stuck stale for 689.171809, current state stale+active+clean,
>> last acting [6,2]
>> pg 2.17 is stuck stale for 689.171794, current state stale+active+clean,
>> last acting [6,2]
>> pg 2.63 is stuck stale for 689.171794, current state stale+active+clean,
>> last acting [2,6]
>> pg 2.77 is stuck stale for 689.171816, current state stale+active+clean,
>> last acting [2,6]
>> pg 2.1b is stuck stale for 689.171793, current state stale+active+clean,
>> last acting [6,2]
>> pg 2.62 is stuck stale for 689.171765, current state stale+active+clean,
>> last acting [2,6]
>> pg 2.30 is stuck stale for 689.171799, current state stale+active+clean,
>> last acting [2,6]
>> pg 2.19 is stuck stale for 689.171798, current state stale+active+clean,
>> last acting [6,2]
>>
>>  # ceph pg dump_stuck stale
>> ok
>> pg_stat state   up      up_primary      acting  acting_primary
>> 2.67    stale+active+clean      [2,6]   2       [2,6]   2
>> 2.5a    stale+active+clean      [6,2]   6       [6,2]   6
>> 2.52    stale+active+clean      [2,6]   2       [2,6]   2
>> 2.4d    stale+active+clean      [2,6]   2       [2,6]   2
>> 2.56    stale+active+clean      [6,2]   6       [6,2]   6
>> 2.d     stale+active+clean      [6,2]   6       [6,2]   6
>> 2.79    stale+active+clean      [2,6]   2       [2,6]   2
>> 2.1f    stale+active+clean      [6,2]   6       [6,2]   6
>> 2.76    stale+active+clean      [6,2]   6       [6,2]   6
>> 2.17    stale+active+clean      [6,2]   6       [6,2]   6
>> 2.63    stale+active+clean      [2,6]   2       [2,6]   2
>> 2.77    stale+active+clean      [2,6]   2       [2,6]   2
>> 2.1b    stale+active+clean      [6,2]   6       [6,2]   6
>> 2.62    stale+active+clean      [2,6]   2       [2,6]   2
>> 2.30    stale+active+clean      [2,6]   2       [2,6]   2
>> 2.19    stale+active+clean      [6,2]   6       [6,2]   6
>>
>> # ceph pg 2.62 query
>> Error ENOENT: i don't have pgid 2.62
>>
>>  # rados ls -p exp-volumes
>> rbd_data.4a41f238e1f29.000000000000003f
>> ^C --> hang
>>
>> I understand that this is a natural result becasue above pgs have no
>> primary and seconary osd. But this situation can be occurred so, I want to
>> recover ceph cluster and rbd images.
>>
>> Firstly I want to know how to make ceph cluster's state clean.
>> I read document and try to solve this but nothing can help including
>> below commands.
>>  - ceph pg force_create_pg 2.6
>>  - ceph osd lost 2 --yes-i-really-mean-it
>>  - ceph osd lost 6 --yes-i-really-mean-it
>>  - ceph osd crush rm osd.2
>>  - ceph osd crush rm osd.6
>>  - cpeh osd rm osd.2
>>  - ceph osd rm osd.6
>>
>> Is there any command to force delete pgs or make ceph cluster clean ?
>> Thank you in advance.
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph pgs state forever stale+active+clean

Reply via email to