Cascading failure

GuangYang Wed, 29 Jul 2015 08:56:18 -0700

Hi Cephers,
I have a (test) ceph cluster, on which I had some wrong CRUSH weight (my 
mistake to set wrong CRUSH weight), then I tried to set the correct CRUSH 
weight (e.g. change the weight from 20 to 5), right after that, the cluster 
became cascading failure mode, lots of OSDs starts getting down, and as I 
started some of them, others went down. I tried the following:
  1> Stop all the client traffic. I changed the weight with the client traffic, 
which triggered the failures.
  2> Increase osd_op_thread_timeout to make sure OSD does not crash itself due 
to heavy load (e.g. load from peering).
  3> Set osd nodown flag sometimes to avoid too many map changes back and forth.


With all those changes, I don't see OSD crash anymore, however, I am still not 
able to bring the down OSD up, although the daemon themselves are alive, and 
that seems stuck forever. For those (down) OSDs, they are still processing the 
peering event, from my observation, they are in that state for at least 12 
hours, constantly log the following messages, and do it over and over again.

2015-07-29 15:49:16.953804 7f46b7281700 10 osd.297 pg_epoch: 56499 pg[6.cd7s0( 
v 3699'7919 lc 0'0 (0'0,3699'7919] local-les=56499 n=7919 ec=387 les/c 
56237/2872 56496/56498/56498) [297,512,1,118,127,161,99,132,23,451,2147483647] 
r=0 lpr=56498 pi=2639-56497/316 crt=3699'7919 mlcod 0'0 inactive m=7919 u=7919] 
search_for_missing 
2f402cd7/default.12598.168_osd042c014.cos.bf2.yahoo.com_7a46ad192c5fdb7ff235ef9a2b68760f/head//6
 2930'4649 also missing on osd.297(0)
2015-07-29 15:49:16.963061 7f46b7281700 10 osd.297 pg_epoch: 56499 pg[6.cd7s0( 
v 3699'7919 lc 0'0 (0'0,3699'7919] local-les=56499 n=7919 ec=387 les/c 
56237/2872 56496/56498/56498) [297,512,1,118,127,161,99,132,23,451,2147483647] 
r=0 lpr=56498 pi=2639-56497/316 crt=3699'7919 mlcod 0'0 inactive m=7919 u=7919] 
search_for_missing 
80502cd7/default.12598.122_osd033c014.cos.bf2.yahoo.com_b9d89ace80b6ff88485269fd20676697/head//6
 2829'1511 also missing on osd.297(0)
2015-07-29 15:49:16.972281 7f46b7281700 10 osd.297 pg_epoch: 56499 pg[6.cd7s0( 
v 3699'7919 lc 0'0 (0'0,3699'7919] local-les=56499 n=7919 ec=387 les/c 
56237/2872 56496/56498/56498) [297,512,1,118,127,161,99,132,23,451,2147483647] 
r=0 lpr=56498 pi=2639-56497/316 crt=3699'7919 mlcod 0'0 inactive m=7919 u=7919] 
search_for_missing 
e1502cd7/default.12598.135_osd004c014.cos.bf2.yahoo.com_ffc8e0df7eee2ff7fb0ddf6fe90d18b0/head//6
 2783'1 also missing on osd.297(0)
2015-07-29 15:49:16.981487 7f46b7281700 10 osd.297 pg_epoch: 56499 pg[6.cd7s0( 
v 3699'7919 lc 0'0 (0'0,3699'7919] local-les=56499 n=7919 ec=387 les/c 
56237/2872 56496/56498/56498) [297,512,1,118,127,161,99,132,23,451,2147483647] 
r=0 lpr=56498 pi=2639-56497/316 crt=3699'7919 mlcod 0'0 inactive m=7919 u=7919] 
search_for_missing 
b4502cd7/default.14462.240_osd030c014.cos.bf2.yahoo.com_cd3acb61271cffa73b9bb6d1622ff294/head//6
 2845'3668 also missing on osd.297(0)

I have several questions:
  1> What could trigger that OSD down (when the daemon is alive), the only 
thing I can think of is that the OSD does not respond to a heartbeat ping, but 
I failed to find some logs for that.
  2> I thought setting the nodown flag should help for such case, but even 
after a long time, when I reset the flag, those OSDs are kicked down 
immediately. Is that expected.

Thanks,
Guang                                     --
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Cascading failure

Reply via email to