Looks like we were hitting 12523, and we are working on a fix. Thanks, Guang
---------------------------------------- > From: [email protected] > To: [email protected] > Subject: Cascading failure > Date: Wed, 29 Jul 2015 08:55:40 -0700 > > Hi Cephers, > I have a (test) ceph cluster, on which I had some wrong CRUSH weight (my > mistake to set wrong CRUSH weight), then I tried to set the correct CRUSH > weight (e.g. change the weight from 20 to 5), right after that, the cluster > became cascading failure mode, lots of OSDs starts getting down, and as I > started some of them, others went down. I tried the following: > 1> Stop all the client traffic. I changed the weight with the client traffic, > which triggered the failures. > 2> Increase osd_op_thread_timeout to make sure OSD does not crash itself due > to heavy load (e.g. load from peering). > 3> Set osd nodown flag sometimes to avoid too many map changes back and forth. > > With all those changes, I don't see OSD crash anymore, however, I am still > not able to bring the down OSD up, although the daemon themselves are alive, > and that seems stuck forever. For those (down) OSDs, they are still > processing the peering event, from my observation, they are in that state for > at least 12 hours, constantly log the following messages, and do it over and > over again. > > 2015-07-29 15:49:16.953804 7f46b7281700 10 osd.297 pg_epoch: 56499 > pg[6.cd7s0( v 3699'7919 lc 0'0 (0'0,3699'7919] local-les=56499 n=7919 ec=387 > les/c 56237/2872 56496/56498/56498) > [297,512,1,118,127,161,99,132,23,451,2147483647] r=0 lpr=56498 > pi=2639-56497/316 crt=3699'7919 mlcod 0'0 inactive m=7919 u=7919] > search_for_missing > 2f402cd7/default.12598.168_osd042c014.cos.bf2.yahoo.com_7a46ad192c5fdb7ff235ef9a2b68760f/head//6 > 2930'4649 also missing on osd.297(0) > 2015-07-29 15:49:16.963061 7f46b7281700 10 osd.297 pg_epoch: 56499 > pg[6.cd7s0( v 3699'7919 lc 0'0 (0'0,3699'7919] local-les=56499 n=7919 ec=387 > les/c 56237/2872 56496/56498/56498) > [297,512,1,118,127,161,99,132,23,451,2147483647] r=0 lpr=56498 > pi=2639-56497/316 crt=3699'7919 mlcod 0'0 inactive m=7919 u=7919] > search_for_missing > 80502cd7/default.12598.122_osd033c014.cos.bf2.yahoo.com_b9d89ace80b6ff88485269fd20676697/head//6 > 2829'1511 also missing on osd.297(0) > 2015-07-29 15:49:16.972281 7f46b7281700 10 osd.297 pg_epoch: 56499 > pg[6.cd7s0( v 3699'7919 lc 0'0 (0'0,3699'7919] local-les=56499 n=7919 ec=387 > les/c 56237/2872 56496/56498/56498) > [297,512,1,118,127,161,99,132,23,451,2147483647] r=0 lpr=56498 > pi=2639-56497/316 crt=3699'7919 mlcod 0'0 inactive m=7919 u=7919] > search_for_missing > e1502cd7/default.12598.135_osd004c014.cos.bf2.yahoo.com_ffc8e0df7eee2ff7fb0ddf6fe90d18b0/head//6 > 2783'1 also missing on osd.297(0) > 2015-07-29 15:49:16.981487 7f46b7281700 10 osd.297 pg_epoch: 56499 > pg[6.cd7s0( v 3699'7919 lc 0'0 (0'0,3699'7919] local-les=56499 n=7919 ec=387 > les/c 56237/2872 56496/56498/56498) > [297,512,1,118,127,161,99,132,23,451,2147483647] r=0 lpr=56498 > pi=2639-56497/316 crt=3699'7919 mlcod 0'0 inactive m=7919 u=7919] > search_for_missing > b4502cd7/default.14462.240_osd030c014.cos.bf2.yahoo.com_cd3acb61271cffa73b9bb6d1622ff294/head//6 > 2845'3668 also missing on osd.297(0) > > I have several questions: > 1> What could trigger that OSD down (when the daemon is alive), the only > thing I can think of is that the OSD does not respond to a heartbeat ping, > but I failed to find some logs for that. > 2> I thought setting the nodown flag should help for such case, but even > after a long time, when I reset the flag, those OSDs are kicked down > immediately. Is that expected. > > Thanks, > Guang
