On 06/22/2010 11:31 AM, Vadym Chepkov wrote: > On Tue, Jun 22, 2010 at 2:21 PM, Steven Dake<[email protected]> wrote: >> On 06/22/2010 11:07 AM, Vadym Chepkov wrote: >>> >>> On Tue, Jun 22, 2010 at 1:49 PM, Steven Dake<[email protected]> wrote: >>>> >>>> On 06/22/2010 03:56 AM, Vadym Chepkov wrote: >>>>> >>>>> Hi, >>>>> >>>>> I decided to check if I can start using corosync again on several of >>>>> my clusters (have to use heartbeat there at the moment). >>>>> I don't even have any services defined in corosync.conf, commented >>>>> pacemaker out, just plain corosync and it never goes down: >>>>> >>>>> # ps axf|grep corosync >>>>> 26294 pts/0 S+ 0:00 | \_ /bin/sh /sbin/service >>>>> corosync restart >>>>> 26299 pts/0 S+ 0:01 | \_ /bin/bash >>>>> /etc/init.d/corosync restart >>>>> 29249 pts/1 S+ 0:00 \_ grep corosync >>>>> 25959 ? Ssl 0:00 corosync >>>>> >>>>> >>>>> I attached to the process and this is where it hangs: >>>>> >>>>> (gdb) where >>>>> #0 0x0fe14134 in poll () from /lib/libc.so.6 >>>>> #1 0x0ffbc530 in poll_run (handle=150346236434579456) at coropoll.c:413 >>>>> #2 0x10006e50 in main (argc=<value optimized out>, argv=<value >>>>> optimized out>) at main.c:1576 >>>>> >>>>> How can I help to debug this problem? >>>>> It is 100% reproducible. >>>>> >>>>> Thank you, >>>>> Vadym >>>>> ________ >>>> >>>> Vadym, >>>> >>>> Thanks for the feedback. I do test this scenario and it works for me: >>>> >>>> [r...@cast flatiron]# service corosync start >>>> Starting Corosync Cluster Engine (corosync): [ OK ] >>>> [r...@cast flatiron]# service corosync restart >>>> Signaling Corosync Cluster Engine (corosync) to terminate: [ OK ] >>>> Waiting for corosync services to unload:. [ OK ] >>>> Starting Corosync Cluster Engine (corosync): [ OK ] >>>> [r...@cast flatiron]# service corosync stop >>>> Signaling Corosync Cluster Engine (corosync) to terminate: [ OK ] >>>> Waiting for corosync services to unload:. [ OK ] >>>> [r...@cast flatiron]# service corosync start >>>> Starting Corosync Cluster Engine (corosync): [ OK ] >>>> [r...@cast flatiron]# /etc/init.d/corosync restart >>>> Signaling Corosync Cluster Engine (corosync) to terminate: [ OK ] >>>> Waiting for corosync services to unload:. [ OK ] >>>> Starting Corosync Cluster Engine (corosync): [ OK ] >>>> >>>> >>>> One thing that would stop corosync from shutting down is if it couldn't >>>> enter operational state. This often happens because of a firewall >>>> enabled >>>> on the ports corosync uses to communicate. >>>> >>>> The system logs would be helpful (with debug: on). >>>> >>>> Regards >>>> -steve >>> >>> >>> And it works fine on Intel based servers, but on Redhat PPC based >>> server it doesn't >>> >>> I attached the config and the log file >>> >>> Thanks, >>> Vadym >> >> Nothing jumps out from the logs. Thanks for the pointer about ppc. I'll >> hunt down some PPC hardware and see if I can reproduce/fix. Could you be >> more specific about which ppc (32 or 64) you were running? Where you >> running BE and LE in same cluster? >> >> Please be patient, however. I don't have any ppc hardware personally, and >> getting access to non-x86 hardware may take me a few days. > > That's why I offered to help, since I have access to the PPC and it's > in my best interests :) > > The kernel is ppc64, but most of the utilities are 32-bit, that's how > Redhat ships PPC. > I compiled 32-bit corosync, anyway. Both machines have identical > kernel, so they can't > have different byte order. > > Thanks, > Vadym
Without shell access, it is pretty difficult to know exactly what goes wrong on a different byte architecture. We have spent significant time in the past making corosync work well on be/le but occasionally new changes break existing archs. Regards -steve _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
