The is nothing to kill. crmd has finished (I can see it in the log) and it's a ghost in defunct state at this point.
On Tue, May 11, 2010 at 8:42 AM, Dejan Muhamedagic <[email protected]> wrote: > Hi, > > On Tue, May 11, 2010 at 07:40:39AM -0400, Vadym Chepkov wrote: > > By the way, reboot is too drastic, I do kill -9 of the corosync > > I guess that corosync is waiting for crmd to stop. Did you try to > kill crmd? > > Thanks, > > Dejan > > > On May 11, 2010, at 7:37 AM, Alain.Moulle wrote: > > > > > Hi Steven , > > > Vadym, just to know: did you execute crm_mon on another window when the > > > corosync > > > shutdown was stalled , just to see if there was some "failed" items ? > > > On my side : I've set debug off and the news (bad or good) is that it > > > did not occur again, > > > but it was also the case since yesterday with debug on ! With debug > off, > > > I've > > > tried 10 times without any problem on corosync shutdown. So I tried > again > > > the thing I thought it was a good clue two days ago : > > > with debug : off (but it is similar with debug on) > > > /etc/init.d/corosync stop => sucessful > > > mv external/ipmi external/ipmi.save to force the start of my > > > resourcetofence to be failed > > > /etc/init.d/corosync start => sucessful > > > but crm_mon shows : > > > restofencenode2 (stonith:external/ipmi): Started node3 > FAILED > > > Failed actions: > > > restofencenode2_start_0 (node=node3, call=5, rc=1, status=complete): > > > unknown error > > > then : > > > /etc/init.d/corosync stop > > > Signaling Corosync Cluster Engine (corosync) to terminate: [ OK ] > > > Waiting for corosync services to > > > unload:............................................ > > > > ............................................................................................................. > > > and it does not return (since about 5mn) > > > So I did : > > > crm resource cleanup restofencenode2 > > > crm resource stop restofencenode2 > > > but unfortunately, it does not help the corosync shutdown to > complete... > > > So I have to reboot the node ... > > > > > > Don't know if this helps but ... ok I'll try other things ... > > > Alain > > > > > > > > >> The bad news - it didn't help, still observing the same issue. > > > > > >> The good news - it's 100% reproducible. > > >> > > >> Vadym > > >> > > >> On May 10, 2010, at 7:19 PM, Steven Dake wrote: > > >> > > >> > > >>>> On Mon, 2010-05-10 at 19:02 -0400, Vadym Chepkov wrote: > > >>> > > >>>>>> Yes, I am > > >>>>>> > > >>>> > > >>>> try without > > >>>> > > >>> > > >>>>>> > > >>>>>> On May 10, 2010, at 6:59 PM, Steven Dake wrote: > > >>>>>> > > >>>> > > >>>>>>>> Do you have debug: on in your config file? > > >>>>>>>> > > >>>>>>>> Regards > > >>>>>>>> -steve > > >>>>>>>> > > >>>>>>>> On Mon, 2010-05-10 at 18:24 -0400, Vadym Chepkov wrote: > > >>>>> > > >>>>>>>>>> Hi, > > >>>>>>>>>> > > >>>>>>>>>> I experienced the same issue on Redhat 5.5 PPC. > > >>>>>>>>>> I compiled all packages myself, since there are no ppc > packages available in the clusterlabs repository. > > >>>>>>>>>> If Andrew will post his SRPM somewhere or maybe instructions > how to compile it, I would be happy to contribute. > > >>>>>>>>>> > > >>>>>>>>>> Vadym > > >>>>>>>>>> > > >>>>>>>>>> On May 10, 2010, at 5:38 PM, Steven Dake wrote: > > >>>>>>>>>> > > >>>>>> > > >>>>>>>>>>>> It seems pretty clear from the mailing list traffic recently > there is a > > >>>>>>>>>>>> critical flaw with the shutdown related in some way to > Pacemaker and > > >>>>>>>>>>>> Corosync that happens on a few people's opensuse systems. > It seems to > > >>>>>>>>>>>> only reproduce on opensuse however we don't know if it is > limited to > > >>>>>>>>>>>> this platform. Finally we want Corosync to work perfectly > for every > > >>>>>>>>>>>> Linux platform and will do everything possible to understand > the > > >>>>>>>>>>>> specific environmental issues that are exposing bugs in > Corosync. > > >>>>>>>>>>>> Unfortunately for several weeks we have been unable in our > labs to > > >>>>>>>>>>>> reproduce this problem which means we need your help! > > >>>>>>>>>>>> > > >>>>>>>>>>>> The developers will work to resolve this problem at our > highest priority > > >>>>>>>>>>>> and release a fix as soon as we can generate an adequate > execution > > >>>>>>>>>>>> trace. > > >>>>>>>>>>>> > > >>>>>>>>>>>> We have a backtrace around where the issue occurred which > presents us > > >>>>>>>>>>>> with enough data to get started. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Our plans are as follows: > > >>>>>>>>>>>> Mon-Wed: Code review of suspected areas and instrumentation > patch > > >>>>>>>>>>>> created > > >>>>>>>>>>>> Thu: Special build created by Andrew with the > instrumentation patch for > > >>>>>>>>>>>> those people affected by this issue. > > >>>>>>>>>>>> We will begin analysis of the instrumentation results once > we have a > > >>>>>>>>>>>> trace. > > >>>>>>>>>>>> > > >>>>>>>>>>>> I would really appreciate those people affected by this > issue to run > > >>>>>>>>>>>> Andrew's special build of Corosync which will have more > trace info in it > > >>>>>>>>>>>> when it is available. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Regards > > >>>>>>>>>>>> -steve > > >>>>>>>>>>>> > > >>>>>>>>>>>> On Mon, 2010-05-10 at 14:26 +0200, Alain.Moulle wrote: > > >>>>>>> > > >>>>>>>>>>>>>> As soon as I got it again ... because it is strange, I did > not face > > >>>>>>>>>>>>>> the problem > > >>>>>>>>>>>>>> again since this morning ! And besides I'm sure that on > Friday I was > > >>>>>>>>>>>>>> in a case where > > >>>>>>>>>>>>>> the stop/cleanup (of a resource failed on start) enables > the corosync > > >>>>>>>>>>>>>> shutdown to > > >>>>>>>>>>>>>> complete , and as long as I had not cleanup the failed > resource, the > > >>>>>>>>>>>>>> corosync stop > > >>>>>>>>>>>>>> does not returns and was stalled in "Waiting for corosync > services to > > >>>>>>>>>>>>>> unload:........ > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> I'll keep you inform if I can find the conditions for this > abnormal > > >>>>>>>>>>>>>> behavior. > > >>>>>>>>>>>>>> Thanks > > >>>>>>>>>>>>>> Regards > > >>>>>>>>>>>>>> Alain > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Andrew Beekhof a ?crit : > > >>>>>>>> > > >>>>>>>>>>>>>>>> On Mon, May 10, 2010 at 8:31 AM, Alain.Moulle < > [email protected]> wrote: > > >>>>>>>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>>>>>>>>>>> I meant "/etc/init.d/corosync stop" never returns. > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Ok. Can you show us the logs and "ps axf" please? > > > _______________________________________________ > > > Openais mailing list > > > [email protected] > > > https://lists.linux-foundation.org/mailman/listinfo/openais > > > > _______________________________________________ > > Openais mailing list > > [email protected] > > https://lists.linux-foundation.org/mailman/listinfo/openais > _______________________________________________ > Openais mailing list > [email protected] > https://lists.linux-foundation.org/mailman/listinfo/openais >
_______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
