Hi so after all I wiped out the SLES11 installation and installed openSUSE11.1 on my test machines.
Currently there are installed the following packes: - cluster-glue 1.0-12.1 - heartbeat 3.0.0-33.2 - libdlm 2.99.08-31.1 - libdlm2 2.99.08-31.1 - libglue1 1.0-12.1 - libopenais2 0.80.5-15.1 - libpacemaker3 1.0.5-20.1 - lvm2 2.02.39-43.8 - openais 0.80.5-15.1 - pacemaker 1.0.5-20.1 - pacemaker-pygui 1.99.2-5.2 - resource-agents 1.0-31.4 After creating /etc/ha.d/authkeys and /etc/ha.d/ha.cf I tried to start heartbeat. The only difference between the configuration files is the IP on the the entry "". But strange things happen! On one machine heartbeat starts and is running as expected. With crm_mon I can see the current state. But on the other machine some minutes nothing happens but then the machine is rebootet. Here is the relevant part of the logfile: Sep 17 13:17:38 opensuse-master heartbeat: [4154]: info: Version 2 support: yes Sep 17 13:17:38 opensuse-master heartbeat: [4154]: info: respawn directive: hacluster /usr/lib/heartbeat/ccm Sep 17 13:17:38 opensuse-master heartbeat: [4154]: info: respawn directive: hacluster /usr/lib/heartbeat/cib Sep 17 13:17:38 opensuse-master heartbeat: [4154]: info: respawn directive: root /usr/lib/heartbeat/lrmd -r Sep 17 13:17:38 opensuse-master heartbeat: [4154]: info: respawn directive: root /usr/lib/heartbeat/stonithd Sep 17 13:17:38 opensuse-master heartbeat: [4154]: info: respawn directive: hacluster /usr/lib/heartbeat/attrd Sep 17 13:17:38 opensuse-master heartbeat: [4154]: info: respawn directive: hacluster /usr/lib/heartbeat/crmd Sep 17 13:17:38 opensuse-master heartbeat: [4154]: info: AUTH: i=1: key = 0x8113f48, auth=0xb7f11050, authname=sha1 Sep 17 13:17:38 opensuse-master heartbeat: [4154]: WARN: Core dumps could be lost if multiple dumps occur. Sep 17 13:17:38 opensuse-master heartbeat: [4154]: WARN: Consider setting non-default value in /proc/sys/kernel/core_pattern (or equivalent) for maximum supportability Sep 17 13:17:38 opensuse-master heartbeat: [4154]: WARN: Consider setting /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability Sep 17 13:17:38 opensuse-master heartbeat: [4154]: WARN: Logging daemon is disabled --enabling logging daemon is recommended Sep 17 13:17:38 opensuse-master heartbeat: [4154]: info: ************************** Sep 17 13:17:38 opensuse-master heartbeat: [4154]: info: Configuration validated. Starting heartbeat 2.99.4 Sep 17 13:17:38 opensuse-master heartbeat: [4154]: info: Heartbeat Hg Version: node: b37cbb1b036c742f0950977495faca78e68aa53d Sep 17 13:17:38 opensuse-master heartbeat: [4156]: info: heartbeat: version 2.99.4 Sep 17 13:17:39 opensuse-master heartbeat: [4156]: info: Heartbeat generation: 1253182976 Sep 17 13:17:39 opensuse-master heartbeat: [4156]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth1 Sep 17 13:17:39 opensuse-master heartbeat: [4156]: info: glib: ucast: bound send socket to device: eth1 Sep 17 13:17:39 opensuse-master heartbeat: [4156]: info: glib: ucast: bound receive socket to device: eth1 Sep 17 13:17:39 opensuse-master heartbeat: [4156]: info: glib: ucast: started on port 694 interface eth1 to 172.17.26.152 Sep 17 13:17:39 opensuse-master heartbeat: [4156]: info: G_main_add_TriggerHandler: Added signal manual handler Sep 17 13:17:39 opensuse-master heartbeat: [4156]: info: G_main_add_TriggerHandler: Added signal manual handler Sep 17 13:17:39 opensuse-master heartbeat: [4156]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Sep 17 13:17:39 opensuse-master heartbeat: [4156]: info: Stack hogger failed 0xffffffff Sep 17 13:17:39 opensuse-master heartbeat: [4156]: info: Local status now set to: 'up' Sep 17 13:17:39 opensuse-master heartbeat: [4156]: info: Managed write_hostcachedata process 4161 exited with return code 0. Sep 17 13:17:40 opensuse-master heartbeat: [4158]: info: Stack hogger failed 0xffffffff Sep 17 13:17:40 opensuse-master heartbeat: [4159]: info: Stack hogger failed 0xffffffff Sep 17 13:17:40 opensuse-master heartbeat: [4160]: info: Stack hogger failed 0xffffffff Sep 17 13:19:39 opensuse-master heartbeat: [4156]: WARN: node opensuse-redundanz: is dead Sep 17 13:19:39 opensuse-master heartbeat: [4156]: info: Comm_now_up(): updating status to active Sep 17 13:19:39 opensuse-master heartbeat: [4156]: info: Local status now set to: 'active' Sep 17 13:19:39 opensuse-master heartbeat: [4156]: info: Starting child client "/usr/lib/heartbeat/ccm" (90,90) Sep 17 13:19:39 opensuse-master heartbeat: [4156]: info: Starting child client "/usr/lib/heartbeat/cib" (90,90) Sep 17 13:19:39 opensuse-master heartbeat: [4217]: info: Starting "/usr/lib/heartbeat/ccm" as uid 90 gid 90 (pid 4217) Sep 17 13:19:39 opensuse-master heartbeat: [4218]: info: Starting "/usr/lib/heartbeat/cib" as uid 90 gid 90 (pid 4218) Sep 17 13:19:39 opensuse-master heartbeat: [4156]: info: Starting child client "/usr/lib/heartbeat/lrmd -r" (0,0) Sep 17 13:19:39 opensuse-master heartbeat: [4219]: info: Starting "/usr/lib/heartbeat/lrmd -r" as uid 0 gid 0 (pid 4219) Sep 17 13:19:39 opensuse-master heartbeat: [4156]: info: Starting child client "/usr/lib/heartbeat/stonithd" (0,0) Sep 17 13:19:39 opensuse-master heartbeat: [4156]: info: Starting child client "/usr/lib/heartbeat/attrd" (90,90) Sep 17 13:19:39 opensuse-master heartbeat: [4220]: info: Starting "/usr/lib/heartbeat/stonithd" as uid 0 gid 0 (pid 4220) Sep 17 13:19:40 opensuse-master heartbeat: [4156]: info: Starting child client "/usr/lib/heartbeat/crmd" (90,90) Sep 17 13:19:40 opensuse-master heartbeat: [4221]: info: Starting "/usr/lib/heartbeat/attrd" as uid 90 gid 90 (pid 4221) Sep 17 13:19:40 opensuse-master cib: [4218]: info: Invoked: /usr/lib/heartbeat/cib Sep 17 13:19:40 opensuse-master cib: [4218]: info: G_main_add_TriggerHandler: Added signal manual handler Sep 17 13:19:40 opensuse-master heartbeat: [4222]: info: Starting "/usr/lib/heartbeat/crmd" as uid 90 gid 90 (pid 4222) Sep 17 13:19:40 opensuse-master cib: [4218]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Sep 17 13:19:40 opensuse-master ccm: [4217]: info: Hostname: opensuse-master Sep 17 13:19:40 opensuse-master lrmd: [4219]: info: G_main_add_SignalHandler: Added signal handler for signal 15 Sep 17 13:19:40 opensuse-master attrd: [4221]: info: Invoked: /usr/lib/heartbeat/attrd Sep 17 13:19:40 opensuse-master attrd: [4221]: info: main: Starting up Sep 17 13:19:40 opensuse-master attrd: [4221]: info: crm_cluster_connect: Unsupported cluster stack: (null) Sep 17 13:19:40 opensuse-master attrd: [4221]: ERROR: main: HA Signon failed Sep 17 13:19:40 opensuse-master attrd: [4221]: info: main: Cluster connection active Sep 17 13:19:40 opensuse-master attrd: [4221]: info: main: Accepting attribute updates Sep 17 13:19:40 opensuse-master attrd: [4221]: ERROR: main: Aborting startup Sep 17 13:19:40 opensuse-master heartbeat: [4156]: WARN: Managed /usr/lib/heartbeat/attrd process 4221 exited with return code 100. Sep 17 13:19:40 opensuse-master heartbeat: [4156]: ERROR: Client /usr/lib/heartbeat/attrd exited with return code 100. Sep 17 13:19:40 opensuse-master stonithd: [4220]: WARN: Core dumps could be lost if multiple dumps occur. Sep 17 13:19:40 opensuse-master cib: [4218]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml (digest: /var/lib/heartbeat/crm/cib.xml.sig) Sep 17 13:19:40 opensuse-master crmd: [4222]: info: Invoked: /usr/lib/heartbeat/crmd Sep 17 13:19:40 opensuse-master crmd: [4222]: info: main: CRM Hg Version: 13f3497959e894e57b8cb24f59c8683346b216e3 Sep 17 13:19:40 opensuse-master heartbeat: [4156]: WARN: G_CH_dispatch_int: Dispatch function for API client took too long to execute: 180 ms (> 100 ms) (GSource: 0x813f5e0) Sep 17 13:19:40 opensuse-master stonithd: [4220]: WARN: Consider setting non-default value in /proc/sys/kernel/core_pattern (or equivalent) for maximum supportability Sep 17 13:19:40 opensuse-master cib: [4218]: WARN: retrieveCib: Cluster configuration not found: /var/lib/heartbeat/crm/cib.xml Sep 17 13:19:40 opensuse-master stonithd: [4220]: WARN: Consider setting /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability Sep 17 13:19:40 opensuse-master cib: [4218]: WARN: readCibXmlFile: Primary configuration corrupt or unusable, trying backup... Sep 17 13:19:40 opensuse-master lrmd: [4219]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Sep 17 13:19:40 opensuse-master stonithd: [4220]: info: G_main_add_SignalHandler: Added signal handler for signal 10 Sep 17 13:19:40 opensuse-master stonithd: [4220]: info: G_main_add_SignalHandler: Added signal handler for signal 12 Sep 17 13:19:40 opensuse-master lrmd: [4219]: WARN: Core dumps could be lost if multiple dumps occur. Sep 17 13:19:40 opensuse-master stonithd: [4220]: info: Stack hogger failed 0xffffffff Sep 17 13:19:40 opensuse-master cib: [4218]: WARN: readCibXmlFile: Continuing with an empty configuration. Sep 17 13:19:40 opensuse-master lrmd: [4219]: WARN: Consider setting non-default value in /proc/sys/kernel/core_pattern (or equivalent) for maximum supportability Sep 17 13:19:40 opensuse-master heartbeat: [4156]: WARN: G_CH_dispatch_int: Dispatch function for API client took too long to execute: 120 ms (> 100 ms) (GSource: 0x813f5e0) Sep 17 13:19:40 opensuse-master stonithd: [4220]: info: crm_cluster_connect: Unsupported cluster stack: (null) Sep 17 13:19:40 opensuse-master lrmd: [4219]: WARN: Consider setting /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability Sep 17 13:19:40 opensuse-master stonithd: [4220]: ERROR: failed to connect to cluster Sep 17 13:19:40 opensuse-master lrmd: [4219]: info: G_main_add_SignalHandler: Added signal handler for signal 10 Sep 17 13:19:40 opensuse-master stonithd: [4220]: ERROR: /usr/lib/heartbeat/stonithd abnormally abort. Sep 17 13:19:40 opensuse-master lrmd: [4219]: info: G_main_add_SignalHandler: Added signal handler for signal 12 Sep 17 13:19:40 opensuse-master lrmd: [4219]: info: Started. Sep 17 13:19:40 opensuse-master heartbeat: [4156]: WARN: G_CH_dispatch_int: Dispatch function for API client took too long to execute: 260 ms (> 100 ms) (GSource: 0x813f5e0) Sep 17 13:19:40 opensuse-master heartbeat: [4156]: WARN: G_SIG_dispatch: Dispatch function for SIGCHLD was delayed 190 ms (> 100 ms) before being called (GSource: 0x81159b0) Sep 17 13:19:40 opensuse-master heartbeat: [4156]: info: G_SIG_dispatch: started at 1718188528 should have started at 1718188509 Sep 17 13:19:40 opensuse-master heartbeat: [4156]: WARN: Managed /usr/lib/heartbeat/stonithd process 4220 exited with return code 100. Sep 17 13:19:40 opensuse-master heartbeat: [4156]: ERROR: Client /usr/lib/heartbeat/stonithd exited with return code 100. Sep 17 13:19:40 opensuse-master crmd: [4222]: info: crmd_init: Starting crmd Sep 17 13:19:40 opensuse-master crmd: [4222]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Sep 17 13:19:40 opensuse-master cib: [4218]: info: startCib: CIB Initialization completed successfully Sep 17 13:19:40 opensuse-master cib: [4218]: info: crm_cluster_connect: Unsupported cluster stack: (null) Sep 17 13:19:40 opensuse-master cib: [4218]: CRIT: cib_init: Cannot sign in to the cluster... terminating Sep 17 13:19:40 opensuse-master heartbeat: [4156]: WARN: Managed /usr/lib/heartbeat/cib process 4218 exited with return code 100. Sep 17 13:19:40 opensuse-master heartbeat: [4156]: ERROR: Client /usr/lib/heartbeat/cib exited with return code 100. Sep 17 13:19:40 opensuse-master heartbeat: [4156]: EMERG: Rebooting system. Reason: /usr/lib/heartbeat/cib Sep 17 13:19:40 opensuse-master ccm: [4217]: info: Break tie for 2 nodes cluster Sep 17 13:19:41 opensuse-master ccm: [4217]: info: G_main_add_SignalHandler: Added signal handler for signal 15 Sep 17 13:19:41 opensuse-master crmd: [4222]: info: do_cib_control: Could not connect to the CIB service: connection failed Sep 17 13:19:41 opensuse-master crmd: [4222]: WARN: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry Sep 17 13:19:41 opensuse-master crmd: [4222]: info: crmd_init: Starting crmd's mainloop (If it is useful for you I can provide the debug output too, but it is a little bit longer than the normal output... ;-) Actually I don't know whats wrong here. There are two machines running openSUSE11.1 with the same heartbeat components installed. One of them runs, the other one dies during startup of heartbeat. Any ideas what to do next? Regards, Yves Schumann Softwareentwicklungsingenieur Security Solutions Division IT-Koordinator ______________________________ Ascom (Schweiz) AG http://www.ascom.com "Walking on water and developing software from a specification are easy if both are frozen" -- Edward V. Berard _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
