Hi,

On Tue, May 11, 2010 at 07:40:39AM -0400, Vadym Chepkov wrote:
> By the way, reboot is too drastic, I do kill -9 of the corosync

I guess that corosync is waiting for crmd to stop. Did you try to
kill crmd?

Thanks,

Dejan

> On May 11, 2010, at 7:37 AM, Alain.Moulle wrote:
> 
> > Hi Steven ,
> > Vadym, just to know: did you execute crm_mon on another window when the 
> > corosync
> > shutdown was stalled , just to see if there was some "failed" items ?
> > On my side : I've set debug off and the news (bad or good) is that it 
> > did not occur again,
> > but it was also the case since yesterday with debug on ! With debug off, 
> > I've
> > tried 10 times without any problem on corosync shutdown.  So I tried again
> > the thing I thought it was a good clue two days ago :
> > with debug : off (but it is similar with debug on)
> > /etc/init.d/corosync stop    => sucessful
> > mv external/ipmi external/ipmi.save to force the start of my 
> > resourcetofence to be failed
> > /etc/init.d/corosync start    => sucessful
> > but crm_mon shows :
> >  restofencenode2        (stonith:external/ipmi):    Started node3 FAILED
> >  Failed actions:
> >    restofencenode2_start_0 (node=node3, call=5, rc=1, status=complete): 
> > unknown error
> > then :
> > /etc/init.d/corosync stop
> > Signaling Corosync Cluster Engine (corosync) to terminate: [  OK  ]
> > Waiting for corosync services to 
> > unload:............................................
> > .............................................................................................................
> > and it does not return (since about 5mn)
> > So I did :
> > crm resource cleanup restofencenode2
> > crm resource stop restofencenode2
> > but unfortunately, it does not help the corosync shutdown to complete...
> > So I have to reboot the node ...
> > 
> > Don't know if this helps but ... ok I'll try other things ...
> > Alain
> > 
> > 
> >> The bad news - it didn't help, still observing the same issue.
> > 
> >> The good news - it's 100% reproducible.
> >> 
> >> Vadym
> >> 
> >> On May 10, 2010, at 7:19 PM, Steven Dake wrote:
> >> 
> >> 
> >>>> On Mon, 2010-05-10 at 19:02 -0400, Vadym Chepkov wrote:
> >>> 
> >>>>>> Yes, I am
> >>>>>> 
> >>>> 
> >>>> try without
> >>>> 
> >>> 
> >>>>>> 
> >>>>>> On May 10, 2010, at 6:59 PM, Steven Dake wrote:
> >>>>>> 
> >>>> 
> >>>>>>>> Do you have debug: on in your config file?
> >>>>>>>> 
> >>>>>>>> Regards
> >>>>>>>> -steve
> >>>>>>>> 
> >>>>>>>> On Mon, 2010-05-10 at 18:24 -0400, Vadym Chepkov wrote:
> >>>>> 
> >>>>>>>>>> Hi,
> >>>>>>>>>> 
> >>>>>>>>>> I experienced the same issue on Redhat 5.5 PPC.
> >>>>>>>>>> I compiled all packages myself, since there are no ppc packages 
> >>>>>>>>>> available in the clusterlabs repository.
> >>>>>>>>>> If Andrew will post his SRPM somewhere or maybe instructions how 
> >>>>>>>>>> to compile it, I would be happy to contribute.
> >>>>>>>>>> 
> >>>>>>>>>> Vadym
> >>>>>>>>>> 
> >>>>>>>>>> On May 10, 2010, at 5:38 PM, Steven Dake wrote:
> >>>>>>>>>> 
> >>>>>> 
> >>>>>>>>>>>> It seems pretty clear from the mailing list traffic recently 
> >>>>>>>>>>>> there is a
> >>>>>>>>>>>> critical flaw with the shutdown related in some way to Pacemaker 
> >>>>>>>>>>>> and
> >>>>>>>>>>>> Corosync that happens on a few people's opensuse systems.  It 
> >>>>>>>>>>>> seems to
> >>>>>>>>>>>> only reproduce on opensuse however we don't know if it is 
> >>>>>>>>>>>> limited to
> >>>>>>>>>>>> this platform.  Finally we want Corosync to work perfectly for 
> >>>>>>>>>>>> every
> >>>>>>>>>>>> Linux platform and will do everything possible to understand the
> >>>>>>>>>>>> specific environmental issues that are exposing bugs in Corosync.
> >>>>>>>>>>>> Unfortunately for several weeks we have been unable in our labs 
> >>>>>>>>>>>> to
> >>>>>>>>>>>> reproduce this problem which means we need your help!
> >>>>>>>>>>>> 
> >>>>>>>>>>>> The developers will work to resolve this problem at our highest 
> >>>>>>>>>>>> priority
> >>>>>>>>>>>> and release a fix as soon as we can generate an adequate 
> >>>>>>>>>>>> execution
> >>>>>>>>>>>> trace.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> We have a backtrace around where the issue occurred which 
> >>>>>>>>>>>> presents us
> >>>>>>>>>>>> with enough data to get started.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Our plans are as follows:
> >>>>>>>>>>>> Mon-Wed: Code review of suspected areas and instrumentation patch
> >>>>>>>>>>>> created
> >>>>>>>>>>>> Thu: Special build created by Andrew with the instrumentation 
> >>>>>>>>>>>> patch for
> >>>>>>>>>>>> those people affected by this issue.
> >>>>>>>>>>>> We will begin analysis of the instrumentation results once we 
> >>>>>>>>>>>> have a
> >>>>>>>>>>>> trace.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> I would really appreciate those people affected by this issue to 
> >>>>>>>>>>>> run
> >>>>>>>>>>>> Andrew's special build of Corosync which will have more trace 
> >>>>>>>>>>>> info in it
> >>>>>>>>>>>> when it is available.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Regards
> >>>>>>>>>>>> -steve 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> On Mon, 2010-05-10 at 14:26 +0200, Alain.Moulle wrote:
> >>>>>>> 
> >>>>>>>>>>>>>> As soon as I got it again ... because it is strange, I did not 
> >>>>>>>>>>>>>> face
> >>>>>>>>>>>>>> the problem
> >>>>>>>>>>>>>> again since this morning ! And besides I'm sure that on Friday 
> >>>>>>>>>>>>>> I was
> >>>>>>>>>>>>>> in a case where
> >>>>>>>>>>>>>> the stop/cleanup (of a resource failed on start) enables the 
> >>>>>>>>>>>>>> corosync
> >>>>>>>>>>>>>> shutdown to
> >>>>>>>>>>>>>> complete , and as long as I had not cleanup the failed 
> >>>>>>>>>>>>>> resource, the
> >>>>>>>>>>>>>> corosync stop 
> >>>>>>>>>>>>>> does not returns and was stalled in "Waiting for corosync 
> >>>>>>>>>>>>>> services to
> >>>>>>>>>>>>>> unload:........
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> I'll keep you inform if I can find the conditions for this 
> >>>>>>>>>>>>>> abnormal
> >>>>>>>>>>>>>> behavior.
> >>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>> Regards
> >>>>>>>>>>>>>> Alain
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Andrew Beekhof a ?crit : 
> >>>>>>>> 
> >>>>>>>>>>>>>>>> On Mon, May 10, 2010 at 8:31 AM, Alain.Moulle 
> >>>>>>>>>>>>>>>> <[email protected]> wrote:
> >>>>>>>>>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>>>>>>>>>>> I meant  "/etc/init.d/corosync stop" never returns.
> >>>>>>>>>>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Ok. Can you show us the logs and "ps axf" please?
> > _______________________________________________
> > Openais mailing list
> > [email protected]
> > https://lists.linux-foundation.org/mailman/listinfo/openais
> 
> _______________________________________________
> Openais mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/openais
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to