[ceph-users] Simulating Disk Failure

Craig Lewis Fri, 14 Jun 2013 17:59:18 -0700

So I'm trying to break my test cluster, and figure out how to put itback together again. I'm able to fix this, but the behavior seemsstrange to me, so I wanted to run it past more experienced people.

I'm doing these tests using RadosGW. I currently have 2 nodes, withreplication=2. (I haven't gotten to the cluster expansion testing yet).

I'm going to upload a file, then simulate a disk failure by deletingsome PGs on one of the OSDs. I have seen this mentioned as the way tofix OSDs that filled up during recovery/backfill. I expected thecluster to detect the error, change the cluster health to warn, thenreturn the data from another copy. Instead, I got a 404 error.




me@client ~ $ s3cmd ls
2013-06-12 00:02  s3://bucket1

me@client ~ $ s3cmd ls s3://bucket1

2013-06-12 00:02 13 8ddd8be4b179a529afa5f2ffae4b9858s3://bucket1/hello.txt


me@client ~ $ s3cmd put Object1 s3://bucket1
Object1 -> s3://bucket1/Object1  [1 of 1]
 400000000 of 400000000   100% in   62s     6.13 MB/s  done

 me@client ~ $ s3cmd ls s3://bucket1

2013-06-13 01:10 381M 15bdad3e014ca5f5c9e5c706e17d65f3s3://bucket1/Object12013-06-12 00:02 13 8ddd8be4b179a529afa5f2ffae4b9858s3://bucket1/hello.txt

So at this point, the cluster is healthy, and we can download objectsfrom RGW.



me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status
   health HEALTH_OK

monmap e2: 2 mons at{dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

   osdmap e44: 2 osds: 2 up, 2 in

pgmap v4055: 248 pgs: 248 active+clean; 2852 MB data, 7941 MB used,94406 MB / 102347 MB avail; 17B/s rd, 0op/s

   mdsmap e1: 0/0/1 up

me@client ~ $ s3cmd get s3://bucket1/Object1 ./Object.Download1
s3://bucket1/Object1 -> ./Object.Download1 [1 of 1]
 400000000 of 400000000   100% in   13s    27.63 MB/s  done

Time to simulate a failure. Let's delete all the PGs used by.rgw.buckets on OSD.0.


me@dev-ceph0:~$ ceph osd tree

# id    weight    type name    up/down    reweight
-1    0.09998    root default
-2    0.04999        host dev-ceph0
0    0.04999            osd.0    up    1
-3    0.04999        host dev-ceph1
1    0.04999            osd.1    up    1


me@dev-ceph0:~$ ceph osd dump | grep .rgw.buckets

pool 9 '.rgw.buckets' rep size 2 min_size 1 crush_ruleset 0 object_hashrjenkins pg_num 8 pgp_num 8 last_change 21 owner 18446744073709551615


me@dev-ceph0:~$ cd /var/lib/ceph/osd/ceph-0/current
me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ du -sh 9.*
321M    9.0_head
289M    9.1_head
425M    9.2_head
357M    9.3_head
358M    9.4_head
309M    9.5_head
401M    9.6_head
397M    9.7_head

me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ sudo rm -rf 9.*




The cluster is still healthy

me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status
   health HEALTH_OK

monmap e2: 2 mons at{dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

   osdmap e44: 2 osds: 2 up, 2 in

pgmap v4059: 248 pgs: 248 active+clean; 2852 MB data, 7941 MB used,94406 MB / 102347 MB avail; 16071KB/s rd, 3op/s

   mdsmap e1: 0/0/1 up

It probably hasn't noticed the damage yet, there's no I/O on this testcluster unless I generate it. Lets retrieve some data, that'll make thecluster notice.


me@client ~ $ s3cmd get s3://bucket1/Object1 ./Object.Download2
s3://bucket1/Object1 -> ./Object.Download2 [1 of 1]
ERROR: S3 error: 404 (Not Found):

me@client ~ $ s3cmd ls s3://bucket1
ERROR: S3 error: 404 (NoSuchKey):

I wasn't expecting that. I expected my object to still be accessible.Worst case, it should be accessible 50% of the time. Instead, it's 0%accessible. And the cluster thinks it's still healhty:


me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status
   health HEALTH_OK

monmap e2: 2 mons at{dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

   osdmap e44: 2 osds: 2 up, 2 in

pgmap v4059: 248 pgs: 248 active+clean; 2852 MB data, 7941 MB used,94406 MB / 102347 MB avail; 16071KB/s rd, 3op/s

   mdsmap e1: 0/0/1 up

Scrubbing the PGs corrects the cluster's status, but still doesn't letme download


me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ for i in `seq 0 7`
>  do
>   ceph pg scrub 9.$i
> done
instructing pg 9.0 on osd.0 to scrub
instructing pg 9.1 on osd.0 to scrub
instructing pg 9.2 on osd.1 to scrub
instructing pg 9.3 on osd.0 to scrub
instructing pg 9.4 on osd.0 to scrub
instructing pg 9.5 on osd.1 to scrub
instructing pg 9.6 on osd.1 to scrub
instructing pg 9.7 on osd.0 to scrub

me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status
   health HEALTH_ERR 3 pgs inconsistent; 284 scrub errors

monmap e2: 2 mons at{dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

   osdmap e44: 2 osds: 2 up, 2 in

pgmap v4105: 248 pgs: 245 active+clean, 3active+clean+inconsistent; 2852 MB data, 5088 MB used, 97258 MB / 102347MB avail

   mdsmap e1: 0/0/1 up




And I still can't download my data

me@client ~ $ s3cmd ls s3://bucket1
ERROR: S3 error: 404 (NoSuchKey):



To fix this, I have to scrub the OSD

me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph osd scrub 0
osd.0 instructed to scrub

This runs for a while, until it reaches the affected PGs. Then the PGsare recovering:


me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status

health HEALTH_ERR 3 pgs inconsistent; 2 pgs recovering; 6 pgsrecovery_wait; 8 pgs stuck unclean; recovery 988/1534 degraded(64.407%); recovering 2 o/s, 10647KB/s; 284 scrub errorsmonmap e2: 2 mons at{dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

   osdmap e47: 2 osds: 2 up, 2 in

pgmap v4151: 248 pgs: 240 active+clean, 4 active+recovery_wait, 1active+recovering+inconsistent, 2 active+recovery_wait+inconsistent, 1active+recovering; 2852 MB data, 5125 MB used

7KB/s
   mdsmap e1: 0/0/1 up



As soon as the cluster starts recovering, I can access my object again:

me@client ~ $ s3cmd ls s3://bucket1

2013-06-13 01:10 381M 15bdad3e014ca5f5c9e5c706e17d65f3s3://bucket1/Object12013-06-12 00:02 13 8ddd8be4b179a529afa5f2ffae4b9858s3://bucket1/hello.txt


me@client ~ $ s3cmd get s3://bucket1/Object1 ./Object.Download2
s3://bucket1/Object1 -> ./Object.Download2  [1 of 1]
 400000000 of 400000000   100% in   92s     4.13 MB/s  done

me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status

health HEALTH_ERR 3 pgs inconsistent; 5 pgs recovering; 5 pgs stuckunclean; recovery 228/1534 degraded (14.863%); recovering 2 o/s,11025KB/s; 284 scrub errorsmonmap e2: 2 mons at{dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

   osdmap e47: 2 osds: 2 up, 2 in

pgmap v4259: 248 pgs: 241 active+clean, 1active+recovering+inconsistent, 2 active+clean+inconsistent, 4active+recovering; 2852 MB data, 7428 MB used, 94919 MB / 102347 MBavail; 22

   mdsmap e1: 0/0/1 up




Everything continues to work, but the cluster doesn't completely heal:

me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status
   health HEALTH_ERR 3 pgs inconsistent; 284 scrub errors

monmap e2: 2 mons at{dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

   osdmap e47: 2 osds: 2 up, 2 in

pgmap v4280: 248 pgs: 245 active+clean, 3active+clean+inconsistent; 2852 MB data, 7934 MB used, 94413 MB / 102347MB avail

   mdsmap e1: 0/0/1 up




At this point, I have to scrub the inconsistent PGs

me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph pg dump | grepinconsistent | cut -f1 | while read pg

>  do
>   ceph pg scrub $pg
> done
instructing pg 9.5 on osd.1 to scrub
instructing pg 9.2 on osd.1 to scrub
instructing pg 9.6 on osd.1 to scrub




Everything continues to work, until cluster has fully recovered.

me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status
   health HEALTH_OK

monmap e2: 2 mons at{dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

   osdmap e47: 2 osds: 2 up, 2 in

pgmap v4283: 248 pgs: 248 active+clean; 2852 MB data, 7934 MB used,94413 MB / 102347 MB avail

   mdsmap e1: 0/0/1 up





So I'm a bit confused.

Why was the data not accessible between the data loss and the manual OSDscrub?

What the effective difference between the PG scrub and the OSD scrub?


Thanks for the help.


--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email [email protected] <mailto:[email protected]>

*Central Desktop. Work together in ways you never thought possible.*

Connect with us Website <http://www.centraldesktop.com/> | Twitter<http://www.twitter.com/centraldesktop> | Facebook<http://www.facebook.com/CentralDesktop> | LinkedIn<http://www.linkedin.com/groups?gid=147417> | Blog<http://cdblog.centraldesktop.com/>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Simulating Disk Failure

Reply via email to