RE: Cascading failure

GuangYang Wed, 29 Jul 2015 22:35:09 -0700

Looks like we were hitting 12523, and we are working on a fix.

Thanks,
Guang



----------------------------------------
> From: [email protected]
> To: [email protected]
> Subject: Cascading failure
> Date: Wed, 29 Jul 2015 08:55:40 -0700
>
> Hi Cephers,
> I have a (test) ceph cluster, on which I had some wrong CRUSH weight (my 
> mistake to set wrong CRUSH weight), then I tried to set the correct CRUSH 
> weight (e.g. change the weight from 20 to 5), right after that, the cluster 
> became cascading failure mode, lots of OSDs starts getting down, and as I 
> started some of them, others went down. I tried the following:
> 1> Stop all the client traffic. I changed the weight with the client traffic, 
> which triggered the failures.
> 2> Increase osd_op_thread_timeout to make sure OSD does not crash itself due 
> to heavy load (e.g. load from peering).
> 3> Set osd nodown flag sometimes to avoid too many map changes back and forth.
>
> With all those changes, I don't see OSD crash anymore, however, I am still 
> not able to bring the down OSD up, although the daemon themselves are alive, 
> and that seems stuck forever. For those (down) OSDs, they are still 
> processing the peering event, from my observation, they are in that state for 
> at least 12 hours, constantly log the following messages, and do it over and 
> over again.
>
> 2015-07-29 15:49:16.953804 7f46b7281700 10 osd.297 pg_epoch: 56499 
> pg[6.cd7s0( v 3699'7919 lc 0'0 (0'0,3699'7919] local-les=56499 n=7919 ec=387 
> les/c 56237/2872 56496/56498/56498) 
> [297,512,1,118,127,161,99,132,23,451,2147483647] r=0 lpr=56498 
> pi=2639-56497/316 crt=3699'7919 mlcod 0'0 inactive m=7919 u=7919] 
> search_for_missing 
> 2f402cd7/default.12598.168_osd042c014.cos.bf2.yahoo.com_7a46ad192c5fdb7ff235ef9a2b68760f/head//6
>  2930'4649 also missing on osd.297(0)
> 2015-07-29 15:49:16.963061 7f46b7281700 10 osd.297 pg_epoch: 56499 
> pg[6.cd7s0( v 3699'7919 lc 0'0 (0'0,3699'7919] local-les=56499 n=7919 ec=387 
> les/c 56237/2872 56496/56498/56498) 
> [297,512,1,118,127,161,99,132,23,451,2147483647] r=0 lpr=56498 
> pi=2639-56497/316 crt=3699'7919 mlcod 0'0 inactive m=7919 u=7919] 
> search_for_missing 
> 80502cd7/default.12598.122_osd033c014.cos.bf2.yahoo.com_b9d89ace80b6ff88485269fd20676697/head//6
>  2829'1511 also missing on osd.297(0)
> 2015-07-29 15:49:16.972281 7f46b7281700 10 osd.297 pg_epoch: 56499 
> pg[6.cd7s0( v 3699'7919 lc 0'0 (0'0,3699'7919] local-les=56499 n=7919 ec=387 
> les/c 56237/2872 56496/56498/56498) 
> [297,512,1,118,127,161,99,132,23,451,2147483647] r=0 lpr=56498 
> pi=2639-56497/316 crt=3699'7919 mlcod 0'0 inactive m=7919 u=7919] 
> search_for_missing 
> e1502cd7/default.12598.135_osd004c014.cos.bf2.yahoo.com_ffc8e0df7eee2ff7fb0ddf6fe90d18b0/head//6
>  2783'1 also missing on osd.297(0)
> 2015-07-29 15:49:16.981487 7f46b7281700 10 osd.297 pg_epoch: 56499 
> pg[6.cd7s0( v 3699'7919 lc 0'0 (0'0,3699'7919] local-les=56499 n=7919 ec=387 
> les/c 56237/2872 56496/56498/56498) 
> [297,512,1,118,127,161,99,132,23,451,2147483647] r=0 lpr=56498 
> pi=2639-56497/316 crt=3699'7919 mlcod 0'0 inactive m=7919 u=7919] 
> search_for_missing 
> b4502cd7/default.14462.240_osd030c014.cos.bf2.yahoo.com_cd3acb61271cffa73b9bb6d1622ff294/head//6
>  2845'3668 also missing on osd.297(0)
>
> I have several questions:
> 1> What could trigger that OSD down (when the daemon is alive), the only 
> thing I can think of is that the OSD does not respond to a heartbeat ping, 
> but I failed to find some logs for that.
> 2> I thought setting the nodown flag should help for such case, but even 
> after a long time, when I reset the flag, those OSDs are kicked down 
> immediately. Is that expected.
>
> Thanks,
> Guang

RE: Cascading failure

Reply via email to