Re: [Openais] plan for resolving corosync services unloading problem blocking shutdown on opensuse

Vadym Chepkov Tue, 11 May 2010 04:41:52 -0700

By the way, reboot is too drastic, I do kill -9 of the corosync

On May 11, 2010, at 7:37 AM, Alain.Moulle wrote:


> Hi Steven ,
> Vadym, just to know: did you execute crm_mon on another window when the 
> corosync
> shutdown was stalled , just to see if there was some "failed" items ?
> On my side : I've set debug off and the news (bad or good) is that it 
> did not occur again,
> but it was also the case since yesterday with debug on ! With debug off, 
> I've
> tried 10 times without any problem on corosync shutdown.  So I tried again
> the thing I thought it was a good clue two days ago :
> with debug : off (but it is similar with debug on)
> /etc/init.d/corosync stop    => sucessful
> mv external/ipmi external/ipmi.save to force the start of my 
> resourcetofence to be failed
> /etc/init.d/corosync start    => sucessful
> but crm_mon shows :
>  restofencenode2        (stonith:external/ipmi):    Started node3 FAILED
>  Failed actions:
>    restofencenode2_start_0 (node=node3, call=5, rc=1, status=complete): 
> unknown error
> then :
> /etc/init.d/corosync stop
> Signaling Corosync Cluster Engine (corosync) to terminate: [  OK  ]
> Waiting for corosync services to 
> unload:............................................
> .............................................................................................................
> and it does not return (since about 5mn)
> So I did :
> crm resource cleanup restofencenode2
> crm resource stop restofencenode2
> but unfortunately, it does not help the corosync shutdown to complete...
> So I have to reboot the node ...
> 
> Don't know if this helps but ... ok I'll try other things ...
> Alain
> 
> 
>> The bad news - it didn't help, still observing the same issue.
> 
>> The good news - it's 100% reproducible.
>> 
>> Vadym
>> 
>> On May 10, 2010, at 7:19 PM, Steven Dake wrote:
>> 
>> 
>>>> On Mon, 2010-05-10 at 19:02 -0400, Vadym Chepkov wrote:
>>> 
>>>>>> Yes, I am
>>>>>> 
>>>> 
>>>> try without
>>>> 
>>> 
>>>>>> 
>>>>>> On May 10, 2010, at 6:59 PM, Steven Dake wrote:
>>>>>> 
>>>> 
>>>>>>>> Do you have debug: on in your config file?
>>>>>>>> 
>>>>>>>> Regards
>>>>>>>> -steve
>>>>>>>> 
>>>>>>>> On Mon, 2010-05-10 at 18:24 -0400, Vadym Chepkov wrote:
>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> I experienced the same issue on Redhat 5.5 PPC.
>>>>>>>>>> I compiled all packages myself, since there are no ppc packages 
>>>>>>>>>> available in the clusterlabs repository.
>>>>>>>>>> If Andrew will post his SRPM somewhere or maybe instructions how to 
>>>>>>>>>> compile it, I would be happy to contribute.
>>>>>>>>>> 
>>>>>>>>>> Vadym
>>>>>>>>>> 
>>>>>>>>>> On May 10, 2010, at 5:38 PM, Steven Dake wrote:
>>>>>>>>>> 
>>>>>> 
>>>>>>>>>>>> It seems pretty clear from the mailing list traffic recently there 
>>>>>>>>>>>> is a
>>>>>>>>>>>> critical flaw with the shutdown related in some way to Pacemaker 
>>>>>>>>>>>> and
>>>>>>>>>>>> Corosync that happens on a few people's opensuse systems.  It 
>>>>>>>>>>>> seems to
>>>>>>>>>>>> only reproduce on opensuse however we don't know if it is limited 
>>>>>>>>>>>> to
>>>>>>>>>>>> this platform.  Finally we want Corosync to work perfectly for 
>>>>>>>>>>>> every
>>>>>>>>>>>> Linux platform and will do everything possible to understand the
>>>>>>>>>>>> specific environmental issues that are exposing bugs in Corosync.
>>>>>>>>>>>> Unfortunately for several weeks we have been unable in our labs to
>>>>>>>>>>>> reproduce this problem which means we need your help!
>>>>>>>>>>>> 
>>>>>>>>>>>> The developers will work to resolve this problem at our highest 
>>>>>>>>>>>> priority
>>>>>>>>>>>> and release a fix as soon as we can generate an adequate execution
>>>>>>>>>>>> trace.
>>>>>>>>>>>> 
>>>>>>>>>>>> We have a backtrace around where the issue occurred which presents 
>>>>>>>>>>>> us
>>>>>>>>>>>> with enough data to get started.
>>>>>>>>>>>> 
>>>>>>>>>>>> Our plans are as follows:
>>>>>>>>>>>> Mon-Wed: Code review of suspected areas and instrumentation patch
>>>>>>>>>>>> created
>>>>>>>>>>>> Thu: Special build created by Andrew with the instrumentation 
>>>>>>>>>>>> patch for
>>>>>>>>>>>> those people affected by this issue.
>>>>>>>>>>>> We will begin analysis of the instrumentation results once we have 
>>>>>>>>>>>> a
>>>>>>>>>>>> trace.
>>>>>>>>>>>> 
>>>>>>>>>>>> I would really appreciate those people affected by this issue to 
>>>>>>>>>>>> run
>>>>>>>>>>>> Andrew's special build of Corosync which will have more trace info 
>>>>>>>>>>>> in it
>>>>>>>>>>>> when it is available.
>>>>>>>>>>>> 
>>>>>>>>>>>> Regards
>>>>>>>>>>>> -steve 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, 2010-05-10 at 14:26 +0200, Alain.Moulle wrote:
>>>>>>> 
>>>>>>>>>>>>>> As soon as I got it again ... because it is strange, I did not 
>>>>>>>>>>>>>> face
>>>>>>>>>>>>>> the problem
>>>>>>>>>>>>>> again since this morning ! And besides I'm sure that on Friday I 
>>>>>>>>>>>>>> was
>>>>>>>>>>>>>> in a case where
>>>>>>>>>>>>>> the stop/cleanup (of a resource failed on start) enables the 
>>>>>>>>>>>>>> corosync
>>>>>>>>>>>>>> shutdown to
>>>>>>>>>>>>>> complete , and as long as I had not cleanup the failed resource, 
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> corosync stop 
>>>>>>>>>>>>>> does not returns and was stalled in "Waiting for corosync 
>>>>>>>>>>>>>> services to
>>>>>>>>>>>>>> unload:........
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I'll keep you inform if I can find the conditions for this 
>>>>>>>>>>>>>> abnormal
>>>>>>>>>>>>>> behavior.
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>> Alain
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Andrew Beekhof a ?crit : 
>>>>>>>> 
>>>>>>>>>>>>>>>> On Mon, May 10, 2010 at 8:31 AM, Alain.Moulle 
>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I meant  "/etc/init.d/corosync stop" never returns.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Ok. Can you show us the logs and "ps axf" please?
> _______________________________________________
> Openais mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/openais

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] plan for resolving corosync services unloading problem blocking shutdown on opensuse

Reply via email to