Re: [Openais] corosync 1.2.5 still doesn't shutdown properly

Vadym Chepkov Tue, 22 Jun 2010 11:56:51 -0700

On Tue, Jun 22, 2010 at 2:42 PM, Steven Dake <[email protected]> wrote:
> On 06/22/2010 11:31 AM, Vadym Chepkov wrote:
>>
>> On Tue, Jun 22, 2010 at 2:21 PM, Steven Dake<[email protected]>  wrote:
>>>
>>> On 06/22/2010 11:07 AM, Vadym Chepkov wrote:
>>>>
>>>> On Tue, Jun 22, 2010 at 1:49 PM, Steven Dake<[email protected]>    wrote:
>>>>>
>>>>> On 06/22/2010 03:56 AM, Vadym Chepkov wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I decided to check if I can start using corosync again on several of
>>>>>> my clusters (have to use heartbeat there at the moment).
>>>>>> I don't even have any services defined in corosync.conf, commented
>>>>>> pacemaker out, just plain corosync and it never goes down:
>>>>>>
>>>>>> # ps axf|grep corosync
>>>>>> 26294 pts/0    S+     0:00  |               \_ /bin/sh /sbin/service
>>>>>> corosync restart
>>>>>> 26299 pts/0    S+     0:01  |                   \_ /bin/bash
>>>>>> /etc/init.d/corosync restart
>>>>>> 29249 pts/1    S+     0:00                  \_ grep corosync
>>>>>> 25959 ?        Ssl    0:00 corosync
>>>>>>
>>>>>>
>>>>>> I attached to the process and this is where it hangs:
>>>>>>
>>>>>> (gdb) where
>>>>>> #0  0x0fe14134 in poll () from /lib/libc.so.6
>>>>>> #1  0x0ffbc530 in poll_run (handle=150346236434579456) at
>>>>>> coropoll.c:413
>>>>>> #2  0x10006e50 in main (argc=<value optimized out>, argv=<value
>>>>>> optimized out>) at main.c:1576
>>>>>>
>>>>>> How can I help to debug this problem?
>>>>>> It is 100% reproducible.
>>>>>>
>>>>>> Thank you,
>>>>>> Vadym
>>>>>> ________
>>>>>
>>>>> Vadym,
>>>>>
>>>>> Thanks for the feedback.  I do test this scenario and it works for me:
>>>>>
>>>>> [r...@cast flatiron]# service corosync start
>>>>> Starting Corosync Cluster Engine (corosync):               [  OK  ]
>>>>> [r...@cast flatiron]# service corosync restart
>>>>> Signaling Corosync Cluster Engine (corosync) to terminate: [  OK  ]
>>>>> Waiting for corosync services to unload:.                  [  OK  ]
>>>>> Starting Corosync Cluster Engine (corosync):               [  OK  ]
>>>>> [r...@cast flatiron]# service corosync stop
>>>>> Signaling Corosync Cluster Engine (corosync) to terminate: [  OK  ]
>>>>> Waiting for corosync services to unload:.                  [  OK  ]
>>>>> [r...@cast flatiron]# service corosync start
>>>>> Starting Corosync Cluster Engine (corosync):               [  OK  ]
>>>>> [r...@cast flatiron]# /etc/init.d/corosync restart
>>>>> Signaling Corosync Cluster Engine (corosync) to terminate: [  OK  ]
>>>>> Waiting for corosync services to unload:.                  [  OK  ]
>>>>> Starting Corosync Cluster Engine (corosync):               [  OK  ]
>>>>>
>>>>>
>>>>> One thing that would stop corosync from shutting down is if it couldn't
>>>>> enter operational state.  This often happens because of a firewall
>>>>> enabled
>>>>> on the ports corosync uses to communicate.
>>>>>
>>>>> The system logs would be helpful (with debug: on).
>>>>>
>>>>> Regards
>>>>> -steve
>>>>
>>>>
>>>> And it works fine on Intel based servers, but on Redhat PPC based
>>>> server it doesn't
>>>>
>>>> I attached the config and the log file
>>>>
>>>> Thanks,
>>>> Vadym
>>>
>>> Nothing jumps out from the logs.  Thanks for the pointer about ppc. I'll
>>> hunt down some PPC hardware and see if I can reproduce/fix.  Could you be
>>> more specific about which ppc (32 or 64) you were running?  Where you
>>> running BE and LE in same cluster?
>>>
>>> Please be patient, however.  I don't have any ppc hardware personally,
>>> and
>>> getting access to non-x86 hardware may take me a few days.
>>
>> That's why I offered to help, since I have access to the PPC and it's
>> in my best interests :)
>>
>> The kernel is ppc64, but most of the utilities are 32-bit, that's how
>> Redhat ships PPC.
>> I compiled 32-bit corosync, anyway. Both machines have identical
>> kernel, so they can't
>> have different byte order.
>>
>> Thanks,
>> Vadym
>
> Without shell access, it is pretty difficult to know exactly what goes wrong
> on a different byte architecture.
>
> We have spent significant time in the past making corosync work well on
> be/le but occasionally new changes break existing archs.
>


I can't provide your with shell access, unfortunately, but I can give
you any info you might need:

$ setarch ppc gcc -E -dM - < /dev/null |grep ENDIAN
#define __BIG_ENDIAN__ 1
#define _BIG_ENDIAN 1

Thanks,
Vadym
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] corosync 1.2.5 still doesn't shutdown properly

Reply via email to