By the way, reboot is too drastic, I do kill -9 of the corosync On May 11, 2010, at 7:37 AM, Alain.Moulle wrote:
> Hi Steven , > Vadym, just to know: did you execute crm_mon on another window when the > corosync > shutdown was stalled , just to see if there was some "failed" items ? > On my side : I've set debug off and the news (bad or good) is that it > did not occur again, > but it was also the case since yesterday with debug on ! With debug off, > I've > tried 10 times without any problem on corosync shutdown. So I tried again > the thing I thought it was a good clue two days ago : > with debug : off (but it is similar with debug on) > /etc/init.d/corosync stop => sucessful > mv external/ipmi external/ipmi.save to force the start of my > resourcetofence to be failed > /etc/init.d/corosync start => sucessful > but crm_mon shows : > restofencenode2 (stonith:external/ipmi): Started node3 FAILED > Failed actions: > restofencenode2_start_0 (node=node3, call=5, rc=1, status=complete): > unknown error > then : > /etc/init.d/corosync stop > Signaling Corosync Cluster Engine (corosync) to terminate: [ OK ] > Waiting for corosync services to > unload:............................................ > ............................................................................................................. > and it does not return (since about 5mn) > So I did : > crm resource cleanup restofencenode2 > crm resource stop restofencenode2 > but unfortunately, it does not help the corosync shutdown to complete... > So I have to reboot the node ... > > Don't know if this helps but ... ok I'll try other things ... > Alain > > >> The bad news - it didn't help, still observing the same issue. > >> The good news - it's 100% reproducible. >> >> Vadym >> >> On May 10, 2010, at 7:19 PM, Steven Dake wrote: >> >> >>>> On Mon, 2010-05-10 at 19:02 -0400, Vadym Chepkov wrote: >>> >>>>>> Yes, I am >>>>>> >>>> >>>> try without >>>> >>> >>>>>> >>>>>> On May 10, 2010, at 6:59 PM, Steven Dake wrote: >>>>>> >>>> >>>>>>>> Do you have debug: on in your config file? >>>>>>>> >>>>>>>> Regards >>>>>>>> -steve >>>>>>>> >>>>>>>> On Mon, 2010-05-10 at 18:24 -0400, Vadym Chepkov wrote: >>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I experienced the same issue on Redhat 5.5 PPC. >>>>>>>>>> I compiled all packages myself, since there are no ppc packages >>>>>>>>>> available in the clusterlabs repository. >>>>>>>>>> If Andrew will post his SRPM somewhere or maybe instructions how to >>>>>>>>>> compile it, I would be happy to contribute. >>>>>>>>>> >>>>>>>>>> Vadym >>>>>>>>>> >>>>>>>>>> On May 10, 2010, at 5:38 PM, Steven Dake wrote: >>>>>>>>>> >>>>>> >>>>>>>>>>>> It seems pretty clear from the mailing list traffic recently there >>>>>>>>>>>> is a >>>>>>>>>>>> critical flaw with the shutdown related in some way to Pacemaker >>>>>>>>>>>> and >>>>>>>>>>>> Corosync that happens on a few people's opensuse systems. It >>>>>>>>>>>> seems to >>>>>>>>>>>> only reproduce on opensuse however we don't know if it is limited >>>>>>>>>>>> to >>>>>>>>>>>> this platform. Finally we want Corosync to work perfectly for >>>>>>>>>>>> every >>>>>>>>>>>> Linux platform and will do everything possible to understand the >>>>>>>>>>>> specific environmental issues that are exposing bugs in Corosync. >>>>>>>>>>>> Unfortunately for several weeks we have been unable in our labs to >>>>>>>>>>>> reproduce this problem which means we need your help! >>>>>>>>>>>> >>>>>>>>>>>> The developers will work to resolve this problem at our highest >>>>>>>>>>>> priority >>>>>>>>>>>> and release a fix as soon as we can generate an adequate execution >>>>>>>>>>>> trace. >>>>>>>>>>>> >>>>>>>>>>>> We have a backtrace around where the issue occurred which presents >>>>>>>>>>>> us >>>>>>>>>>>> with enough data to get started. >>>>>>>>>>>> >>>>>>>>>>>> Our plans are as follows: >>>>>>>>>>>> Mon-Wed: Code review of suspected areas and instrumentation patch >>>>>>>>>>>> created >>>>>>>>>>>> Thu: Special build created by Andrew with the instrumentation >>>>>>>>>>>> patch for >>>>>>>>>>>> those people affected by this issue. >>>>>>>>>>>> We will begin analysis of the instrumentation results once we have >>>>>>>>>>>> a >>>>>>>>>>>> trace. >>>>>>>>>>>> >>>>>>>>>>>> I would really appreciate those people affected by this issue to >>>>>>>>>>>> run >>>>>>>>>>>> Andrew's special build of Corosync which will have more trace info >>>>>>>>>>>> in it >>>>>>>>>>>> when it is available. >>>>>>>>>>>> >>>>>>>>>>>> Regards >>>>>>>>>>>> -steve >>>>>>>>>>>> >>>>>>>>>>>> On Mon, 2010-05-10 at 14:26 +0200, Alain.Moulle wrote: >>>>>>> >>>>>>>>>>>>>> As soon as I got it again ... because it is strange, I did not >>>>>>>>>>>>>> face >>>>>>>>>>>>>> the problem >>>>>>>>>>>>>> again since this morning ! And besides I'm sure that on Friday I >>>>>>>>>>>>>> was >>>>>>>>>>>>>> in a case where >>>>>>>>>>>>>> the stop/cleanup (of a resource failed on start) enables the >>>>>>>>>>>>>> corosync >>>>>>>>>>>>>> shutdown to >>>>>>>>>>>>>> complete , and as long as I had not cleanup the failed resource, >>>>>>>>>>>>>> the >>>>>>>>>>>>>> corosync stop >>>>>>>>>>>>>> does not returns and was stalled in "Waiting for corosync >>>>>>>>>>>>>> services to >>>>>>>>>>>>>> unload:........ >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'll keep you inform if I can find the conditions for this >>>>>>>>>>>>>> abnormal >>>>>>>>>>>>>> behavior. >>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>> Regards >>>>>>>>>>>>>> Alain >>>>>>>>>>>>>> >>>>>>>>>>>>>> Andrew Beekhof a ?crit : >>>>>>>> >>>>>>>>>>>>>>>> On Mon, May 10, 2010 at 8:31 AM, Alain.Moulle >>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>> I meant "/etc/init.d/corosync stop" never returns. >>>>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Ok. Can you show us the logs and "ps axf" please? > _______________________________________________ > Openais mailing list > [email protected] > https://lists.linux-foundation.org/mailman/listinfo/openais _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
