Re: [Openais] plan for resolving corosync services unloading problem blocking shutdown on opensuse

Alain.Moulle Tue, 11 May 2010 04:33:21 -0700

Hi Steven ,
Vadym, just to know: did you execute crm_mon on another window when the 
corosync
shutdown was stalled , just to see if there was some "failed" items ?
On my side : I've set debug off and the news (bad or good) is that it 
did not occur again,
but it was also the case since yesterday with debug on ! With debug off, 
I've
tried 10 times without any problem on corosync shutdown.  So I tried again
the thing I thought it was a good clue two days ago :
with debug : off (but it is similar with debug on)
/etc/init.d/corosync stop    => sucessful
mv external/ipmi external/ipmi.save to force the start of my 
resourcetofence to be failed
/etc/init.d/corosync start    => sucessful
but crm_mon shows :
  restofencenode2        (stonith:external/ipmi):    Started node3 FAILED
  Failed actions:
    restofencenode2_start_0 (node=node3, call=5, rc=1, status=complete): 
unknown error
then :
/etc/init.d/corosync stop
Signaling Corosync Cluster Engine (corosync) to terminate: [  OK  ]
Waiting for corosync services to 
unload:............................................
.............................................................................................................
and it does not return (since about 5mn)
So I did :
crm resource cleanup restofencenode2
crm resource stop restofencenode2
but unfortunately, it does not help the corosync shutdown to complete...
So I have to reboot the node ...


Don't know if this helps but ... ok I'll try other things ...
Alain


> The bad news - it didn't help, still observing the same issue.

> The good news - it's 100% reproducible.
>
> Vadym
>
> On May 10, 2010, at 7:19 PM, Steven Dake wrote:
>
>   
>> > On Mon, 2010-05-10 at 19:02 -0400, Vadym Chepkov wrote:
>>     
>>> >> Yes, I am
>>> >> 
>>>       
>> > try without
>> > 
>>     
>>> >> 
>>> >> On May 10, 2010, at 6:59 PM, Steven Dake wrote:
>>> >> 
>>>       
>>>> >>> Do you have debug: on in your config file?
>>>> >>> 
>>>> >>> Regards
>>>> >>> -steve
>>>> >>> 
>>>> >>> On Mon, 2010-05-10 at 18:24 -0400, Vadym Chepkov wrote:
>>>>         
>>>>> >>>> Hi,
>>>>> >>>> 
>>>>> >>>> I experienced the same issue on Redhat 5.5 PPC.
>>>>> >>>> I compiled all packages myself, since there are no ppc packages 
>>>>> >>>> available in the clusterlabs repository.
>>>>> >>>> If Andrew will post his SRPM somewhere or maybe instructions how to 
>>>>> >>>> compile it, I would be happy to contribute.
>>>>> >>>> 
>>>>> >>>> Vadym
>>>>> >>>> 
>>>>> >>>> On May 10, 2010, at 5:38 PM, Steven Dake wrote:
>>>>> >>>> 
>>>>>           
>>>>>> >>>>> It seems pretty clear from the mailing list traffic recently there 
>>>>>> >>>>> is a
>>>>>> >>>>> critical flaw with the shutdown related in some way to Pacemaker 
>>>>>> >>>>> and
>>>>>> >>>>> Corosync that happens on a few people's opensuse systems.  It 
>>>>>> >>>>> seems to
>>>>>> >>>>> only reproduce on opensuse however we don't know if it is limited 
>>>>>> >>>>> to
>>>>>> >>>>> this platform.  Finally we want Corosync to work perfectly for 
>>>>>> >>>>> every
>>>>>> >>>>> Linux platform and will do everything possible to understand the
>>>>>> >>>>> specific environmental issues that are exposing bugs in Corosync.
>>>>>> >>>>> Unfortunately for several weeks we have been unable in our labs to
>>>>>> >>>>> reproduce this problem which means we need your help!
>>>>>> >>>>> 
>>>>>> >>>>> The developers will work to resolve this problem at our highest 
>>>>>> >>>>> priority
>>>>>> >>>>> and release a fix as soon as we can generate an adequate execution
>>>>>> >>>>> trace.
>>>>>> >>>>> 
>>>>>> >>>>> We have a backtrace around where the issue occurred which presents 
>>>>>> >>>>> us
>>>>>> >>>>> with enough data to get started.
>>>>>> >>>>> 
>>>>>> >>>>> Our plans are as follows:
>>>>>> >>>>> Mon-Wed: Code review of suspected areas and instrumentation patch
>>>>>> >>>>> created
>>>>>> >>>>> Thu: Special build created by Andrew with the instrumentation 
>>>>>> >>>>> patch for
>>>>>> >>>>> those people affected by this issue.
>>>>>> >>>>> We will begin analysis of the instrumentation results once we have 
>>>>>> >>>>> a
>>>>>> >>>>> trace.
>>>>>> >>>>> 
>>>>>> >>>>> I would really appreciate those people affected by this issue to 
>>>>>> >>>>> run
>>>>>> >>>>> Andrew's special build of Corosync which will have more trace info 
>>>>>> >>>>> in it
>>>>>> >>>>> when it is available.
>>>>>> >>>>> 
>>>>>> >>>>> Regards
>>>>>> >>>>> -steve 
>>>>>> >>>>> 
>>>>>> >>>>> On Mon, 2010-05-10 at 14:26 +0200, Alain.Moulle wrote:
>>>>>>             
>>>>>>> >>>>>> As soon as I got it again ... because it is strange, I did not 
>>>>>>> >>>>>> face
>>>>>>> >>>>>> the problem
>>>>>>> >>>>>> again since this morning ! And besides I'm sure that on Friday I 
>>>>>>> >>>>>> was
>>>>>>> >>>>>> in a case where
>>>>>>> >>>>>> the stop/cleanup (of a resource failed on start) enables the 
>>>>>>> >>>>>> corosync
>>>>>>> >>>>>> shutdown to
>>>>>>> >>>>>> complete , and as long as I had not cleanup the failed resource, 
>>>>>>> >>>>>> the
>>>>>>> >>>>>> corosync stop 
>>>>>>> >>>>>> does not returns and was stalled in "Waiting for corosync 
>>>>>>> >>>>>> services to
>>>>>>> >>>>>> unload:........
>>>>>>> >>>>>> 
>>>>>>> >>>>>> I'll keep you inform if I can find the conditions for this 
>>>>>>> >>>>>> abnormal
>>>>>>> >>>>>> behavior.
>>>>>>> >>>>>> Thanks
>>>>>>> >>>>>> Regards
>>>>>>> >>>>>> Alain
>>>>>>> >>>>>> 
>>>>>>> >>>>>> Andrew Beekhof a ?crit : 
>>>>>>>               
>>>>>>>> >>>>>>> On Mon, May 10, 2010 at 8:31 AM, Alain.Moulle 
>>>>>>>> >>>>>>> <[email protected]> wrote:
>>>>>>>> >>>>>>> 
>>>>>>>>                 
>>>>>>>>> >>>>>>>> I meant  "/etc/init.d/corosync stop" never returns.
>>>>>>>>> >>>>>>>> 
>>>>>>>>>                   
>>>>>>>> >>>>>>> 
>>>>>>>> >>>>>>> Ok. Can you show us the logs and "ps axf" please?
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] plan for resolving corosync services unloading problem blocking shutdown on opensuse

Reply via email to