On Tue, Jun 22, 2010 at 2:21 PM, Steven Dake <[email protected]> wrote: > On 06/22/2010 11:07 AM, Vadym Chepkov wrote: >> >> On Tue, Jun 22, 2010 at 1:49 PM, Steven Dake<[email protected]> wrote: >>> >>> On 06/22/2010 03:56 AM, Vadym Chepkov wrote: >>>> >>>> Hi, >>>> >>>> I decided to check if I can start using corosync again on several of >>>> my clusters (have to use heartbeat there at the moment). >>>> I don't even have any services defined in corosync.conf, commented >>>> pacemaker out, just plain corosync and it never goes down: >>>> >>>> # ps axf|grep corosync >>>> 26294 pts/0 S+ 0:00 | \_ /bin/sh /sbin/service >>>> corosync restart >>>> 26299 pts/0 S+ 0:01 | \_ /bin/bash >>>> /etc/init.d/corosync restart >>>> 29249 pts/1 S+ 0:00 \_ grep corosync >>>> 25959 ? Ssl 0:00 corosync >>>> >>>> >>>> I attached to the process and this is where it hangs: >>>> >>>> (gdb) where >>>> #0 0x0fe14134 in poll () from /lib/libc.so.6 >>>> #1 0x0ffbc530 in poll_run (handle=150346236434579456) at coropoll.c:413 >>>> #2 0x10006e50 in main (argc=<value optimized out>, argv=<value >>>> optimized out>) at main.c:1576 >>>> >>>> How can I help to debug this problem? >>>> It is 100% reproducible. >>>> >>>> Thank you, >>>> Vadym >>>> ________ >>> >>> Vadym, >>> >>> Thanks for the feedback. I do test this scenario and it works for me: >>> >>> [r...@cast flatiron]# service corosync start >>> Starting Corosync Cluster Engine (corosync): [ OK ] >>> [r...@cast flatiron]# service corosync restart >>> Signaling Corosync Cluster Engine (corosync) to terminate: [ OK ] >>> Waiting for corosync services to unload:. [ OK ] >>> Starting Corosync Cluster Engine (corosync): [ OK ] >>> [r...@cast flatiron]# service corosync stop >>> Signaling Corosync Cluster Engine (corosync) to terminate: [ OK ] >>> Waiting for corosync services to unload:. [ OK ] >>> [r...@cast flatiron]# service corosync start >>> Starting Corosync Cluster Engine (corosync): [ OK ] >>> [r...@cast flatiron]# /etc/init.d/corosync restart >>> Signaling Corosync Cluster Engine (corosync) to terminate: [ OK ] >>> Waiting for corosync services to unload:. [ OK ] >>> Starting Corosync Cluster Engine (corosync): [ OK ] >>> >>> >>> One thing that would stop corosync from shutting down is if it couldn't >>> enter operational state. This often happens because of a firewall >>> enabled >>> on the ports corosync uses to communicate. >>> >>> The system logs would be helpful (with debug: on). >>> >>> Regards >>> -steve >> >> >> And it works fine on Intel based servers, but on Redhat PPC based >> server it doesn't >> >> I attached the config and the log file >> >> Thanks, >> Vadym > > Nothing jumps out from the logs. Thanks for the pointer about ppc. I'll > hunt down some PPC hardware and see if I can reproduce/fix. Could you be > more specific about which ppc (32 or 64) you were running? Where you > running BE and LE in same cluster? > > Please be patient, however. I don't have any ppc hardware personally, and > getting access to non-x86 hardware may take me a few days.
That's why I offered to help, since I have access to the PPC and it's in my best interests :) The kernel is ppc64, but most of the utilities are 32-bit, that's how Redhat ships PPC. I compiled 32-bit corosync, anyway. Both machines have identical kernel, so they can't have different byte order. Thanks, Vadym _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
