Thanks for trying to help! Currently I can't provide crm_report from the failed node, as I've decided to restore the complete node from backup. The versions I use are corosync-1.3.0 and pacemaker-1.0.10. Actually the problem occurred after updating quiet a few system packages, but all the cluster related software was untouched. I've found exactly the same issue described in the mailing list earlier: http://www.gossamer-threads.com/lists/linuxha/pacemaker/77881?do=post_view_threaded#77881 At least symptoms are exactly the same as well as pasted log files. I've tried enable debug logging as well and saw that crm tries to connect to cib sockets (/var/run/crm_*) too early (IMO) and fails because cib wasn't started yet. I'm planning to repeat update of these system again, but I'll do this more carefully in order to understand which particular package leads to this behavior. BTW, how can I create crm_report? I can't find this binary anywhere on the system. Let me know what kind of input you'll need if I'll be able to reproduce this problem.
Regards, Vlad. On Tue, 2012-10-30 at 16:00 +1100, Andrew Beekhof wrote: > On Sun, Oct 28, 2012 at 9:05 PM, Vladimir Elisseev <vo...@vovan.nl> wrote: > > Hello, > > > > I'm having problem that after reboot one cluster node can't join cluster > > anymore. Form the log file I can't understand what actually is going on. > > I only can see, that cib and crm both are respawned frequently. I'd > > appreciate any help. Below is relevant part of the log file: > > I appreciate that you're trying to keep it brief, but problems often > originate much earlier than people suspect. > Can you instead attach a crm_report tarball, that will have everything > (from both nodes) that we need to be able to help. > > What version is this btw? > > > > > Oct 28 10:52:22 srv2 cib: [10646]: info: cib_server_process_diff: > > Requesting re-sync from peer > > Oct 28 10:52:22 srv2 cib: [10646]: WARN: cib_diff_notify: Local-only Change > > (client:crmd, call: 4770): -1.-1.-1 (Application of an update diff failed, > > requesting a full refresh) > > Oct 28 10:52:22 srv2 cib: [10653]: info: retrieveCib: Reading cluster > > configuration from: /var/lib/heartbeat/crm/cib.qJTUAV (digest: > > /var/lib/heartbeat/crm/cib.XwOKXQ) > > Oct 28 10:52:22 srv2 cib: [10646]: WARN: cib_server_process_diff: Not > > applying diff 0.1298.5 -> 0.1299.1 (sync in progress) > > Oct 28 10:52:22 srv2 cib: [10646]: info: cib_replace_notify: Local-only > > Replace: -1.-1.-1 from srv1 > > Oct 28 10:52:22 corosync [pcmk]: ] info: pcmk_ipc_exit: Client cib > > (conn=0x1837340, async-conn=0x1837340) left > > Oct 28 10:52:22 corosync [pcmk]: ] ERROR: pcmk_wait_dispatch: Child > > process cib terminated with signal 6 (pid=10646, core=true) > > Oct 28 10:52:22 corosync [pcmk]: ] notice: pcmk_wait_dispatch: Respawning > > failed child process: cib > > Oct 28 10:52:22 corosync [pcmk]: ] info: spawn_child: Forked child 10656 > > for process cib > > Oct 28 10:52:22 srv2 cib: [10656]: info: Invoked: /usr/lib64/heartbeat/cib > > > > > > Regards, > > Vlad. > > > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org