Hi Steven ,
Vadym, just to know: did you execute crm_mon on another window when the
corosync
shutdown was stalled , just to see if there was some "failed" items ?
On my side : I've set debug off and the news (bad or good) is that it
did not occur again,
but it was also the case since yesterday with debug on ! With debug off,
I've
tried 10 times without any problem on corosync shutdown. So I tried again
the thing I thought it was a good clue two days ago :
with debug : off (but it is similar with debug on)
/etc/init.d/corosync stop => sucessful
mv external/ipmi external/ipmi.save to force the start of my
resourcetofence to be failed
/etc/init.d/corosync start => sucessful
but crm_mon shows :
restofencenode2 (stonith:external/ipmi): Started node3 FAILED
Failed actions:
restofencenode2_start_0 (node=node3, call=5, rc=1, status=complete):
unknown error
then :
/etc/init.d/corosync stop
Signaling Corosync Cluster Engine (corosync) to terminate: [ OK ]
Waiting for corosync services to
unload:............................................
.............................................................................................................
and it does not return (since about 5mn)
So I did :
crm resource cleanup restofencenode2
crm resource stop restofencenode2
but unfortunately, it does not help the corosync shutdown to complete...
So I have to reboot the node ...
Don't know if this helps but ... ok I'll try other things ...
Alain
> The bad news - it didn't help, still observing the same issue.
> The good news - it's 100% reproducible.
>
> Vadym
>
> On May 10, 2010, at 7:19 PM, Steven Dake wrote:
>
>
>> > On Mon, 2010-05-10 at 19:02 -0400, Vadym Chepkov wrote:
>>
>>> >> Yes, I am
>>> >>
>>>
>> > try without
>> >
>>
>>> >>
>>> >> On May 10, 2010, at 6:59 PM, Steven Dake wrote:
>>> >>
>>>
>>>> >>> Do you have debug: on in your config file?
>>>> >>>
>>>> >>> Regards
>>>> >>> -steve
>>>> >>>
>>>> >>> On Mon, 2010-05-10 at 18:24 -0400, Vadym Chepkov wrote:
>>>>
>>>>> >>>> Hi,
>>>>> >>>>
>>>>> >>>> I experienced the same issue on Redhat 5.5 PPC.
>>>>> >>>> I compiled all packages myself, since there are no ppc packages
>>>>> >>>> available in the clusterlabs repository.
>>>>> >>>> If Andrew will post his SRPM somewhere or maybe instructions how to
>>>>> >>>> compile it, I would be happy to contribute.
>>>>> >>>>
>>>>> >>>> Vadym
>>>>> >>>>
>>>>> >>>> On May 10, 2010, at 5:38 PM, Steven Dake wrote:
>>>>> >>>>
>>>>>
>>>>>> >>>>> It seems pretty clear from the mailing list traffic recently there
>>>>>> >>>>> is a
>>>>>> >>>>> critical flaw with the shutdown related in some way to Pacemaker
>>>>>> >>>>> and
>>>>>> >>>>> Corosync that happens on a few people's opensuse systems. It
>>>>>> >>>>> seems to
>>>>>> >>>>> only reproduce on opensuse however we don't know if it is limited
>>>>>> >>>>> to
>>>>>> >>>>> this platform. Finally we want Corosync to work perfectly for
>>>>>> >>>>> every
>>>>>> >>>>> Linux platform and will do everything possible to understand the
>>>>>> >>>>> specific environmental issues that are exposing bugs in Corosync.
>>>>>> >>>>> Unfortunately for several weeks we have been unable in our labs to
>>>>>> >>>>> reproduce this problem which means we need your help!
>>>>>> >>>>>
>>>>>> >>>>> The developers will work to resolve this problem at our highest
>>>>>> >>>>> priority
>>>>>> >>>>> and release a fix as soon as we can generate an adequate execution
>>>>>> >>>>> trace.
>>>>>> >>>>>
>>>>>> >>>>> We have a backtrace around where the issue occurred which presents
>>>>>> >>>>> us
>>>>>> >>>>> with enough data to get started.
>>>>>> >>>>>
>>>>>> >>>>> Our plans are as follows:
>>>>>> >>>>> Mon-Wed: Code review of suspected areas and instrumentation patch
>>>>>> >>>>> created
>>>>>> >>>>> Thu: Special build created by Andrew with the instrumentation
>>>>>> >>>>> patch for
>>>>>> >>>>> those people affected by this issue.
>>>>>> >>>>> We will begin analysis of the instrumentation results once we have
>>>>>> >>>>> a
>>>>>> >>>>> trace.
>>>>>> >>>>>
>>>>>> >>>>> I would really appreciate those people affected by this issue to
>>>>>> >>>>> run
>>>>>> >>>>> Andrew's special build of Corosync which will have more trace info
>>>>>> >>>>> in it
>>>>>> >>>>> when it is available.
>>>>>> >>>>>
>>>>>> >>>>> Regards
>>>>>> >>>>> -steve
>>>>>> >>>>>
>>>>>> >>>>> On Mon, 2010-05-10 at 14:26 +0200, Alain.Moulle wrote:
>>>>>>
>>>>>>> >>>>>> As soon as I got it again ... because it is strange, I did not
>>>>>>> >>>>>> face
>>>>>>> >>>>>> the problem
>>>>>>> >>>>>> again since this morning ! And besides I'm sure that on Friday I
>>>>>>> >>>>>> was
>>>>>>> >>>>>> in a case where
>>>>>>> >>>>>> the stop/cleanup (of a resource failed on start) enables the
>>>>>>> >>>>>> corosync
>>>>>>> >>>>>> shutdown to
>>>>>>> >>>>>> complete , and as long as I had not cleanup the failed resource,
>>>>>>> >>>>>> the
>>>>>>> >>>>>> corosync stop
>>>>>>> >>>>>> does not returns and was stalled in "Waiting for corosync
>>>>>>> >>>>>> services to
>>>>>>> >>>>>> unload:........
>>>>>>> >>>>>>
>>>>>>> >>>>>> I'll keep you inform if I can find the conditions for this
>>>>>>> >>>>>> abnormal
>>>>>>> >>>>>> behavior.
>>>>>>> >>>>>> Thanks
>>>>>>> >>>>>> Regards
>>>>>>> >>>>>> Alain
>>>>>>> >>>>>>
>>>>>>> >>>>>> Andrew Beekhof a ?crit :
>>>>>>>
>>>>>>>> >>>>>>> On Mon, May 10, 2010 at 8:31 AM, Alain.Moulle
>>>>>>>> >>>>>>> <[email protected]> wrote:
>>>>>>>> >>>>>>>
>>>>>>>>
>>>>>>>>> >>>>>>>> I meant "/etc/init.d/corosync stop" never returns.
>>>>>>>>> >>>>>>>>
>>>>>>>>>
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> Ok. Can you show us the logs and "ps axf" please?
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais