I need to see logs from both nodes that relate to the same instance of the issue.
Why are the dates so crazy? One is from a year ago and the other is in the (at the time) future. > On 2 Dec 2014, at 7:04 pm, Dmitry Matveichev <d.matveic...@mfisoft.ru> wrote: > > Hello, > Any thoughts about this issue? It still affects our cluster. > > ------------------------ > Kind regards, > Dmitriy Matveichev. > > > -----Original Message----- > From: Dmitry Matveichev > Sent: Monday, November 17, 2014 12:32 PM > To: The Pacemaker cluster resource manager > Subject: RE: [Pacemaker] Long failover > > Hello, > > Debug logs from slave are attached. Hope it helps. > > ------------------------ > Kind regards, > Dmitriy Matveichev. > > -----Original Message----- > From: Andrew Beekhof [mailto:and...@beekhof.net] > Sent: Monday, November 17, 2014 10:48 AM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Long failover > > >> On 17 Nov 2014, at 6:17 pm, Andrei Borzenkov <arvidj...@gmail.com> wrote: >> >> On Mon, Nov 17, 2014 at 9:34 AM, Andrew Beekhof <and...@beekhof.net> wrote: >>> >>>> On 14 Nov 2014, at 10:57 pm, Dmitry Matveichev <d.matveic...@mfisoft.ru> >>>> wrote: >>>> >>>> Hello, >>>> >>>> We have a cluster configured via pacemaker+corosync+crm. The configuration >>>> is: >>>> >>>> node master >>>> node slave >>>> primitive HA-VIP1 IPaddr2 \ >>>> params ip=192.168.22.71 nic=bond0 \ >>>> op monitor interval=1s >>>> primitive HA-variator lsb: variator \ >>>> op monitor interval=1s \ >>>> meta migration-threshold=1 failure-timeout=1s group HA-Group >>>> HA-VIP1 HA-variator property cib-bootstrap-options: \ >>>> dc-version=1.1.10-14.el6-368c726 \ >>>> cluster-infrastructure="classic openais (with plugin)" \ >>> >>> General advice, don't use the plugin. See: >>> >>> http://blog.clusterlabs.org/blog/2013/pacemaker-and-rhel-6-dot-4/ >>> http://blog.clusterlabs.org/blog/2013/pacemaker-on-rhel6-dot-4/ >>> >>>> expected-quorum-votes=2 \ >>>> stonith-enabled=false \ >>>> no-quorum-policy=ignore \ >>>> last-lrm-refresh=1383871087 >>>> rsc_defaults rsc-options: \ >>>> resource-stickiness=100 >>>> >>>> Firstly I make the variator service down on the master node (actually I >>>> delete the service binary and kill the variator process, so the variator >>>> fails to restart). Resources very quickly move on the slave node as >>>> expected. Then I return the binary on the master and restart the variator >>>> service. Now I make the same stuff with binary and service on slave node. >>>> The crm status command quickly shows me HA-variator (lsb: variator): >>>> Stopped. But it take to much time (for us) before recourses are >>>> switched on the master node (around 1 min). >>> >>> I see what you mean: >>> >>> 2013-12-21T07:04:12.230827+04:00 master crmd[14267]: notice: >>> te_rsc_command: Initiating action 2: monitor HA-variator_monitor_1000 on >>> slave.mfisoft.ru >>> 2013-12-21T05:45:09+04:00 slave crmd[7086]: notice: process_lrm_event: >>> slave.mfisoft.ru-HA-variator_monitor_1000:106 [ variator.x is stopped\n ] >>> >>> (1 minute goes by) >>> >>> 2013-12-21T07:05:14.232029+04:00 master crmd[14267]: error: >>> print_synapse: [Action 2]: In-flight rsc op HA-variator_monitor_1000 on >>> slave.mfisoft.ru (priority: 0, waiting: none) >>> 2013-12-21T07:05:14.232102+04:00 master crmd[14267]: warning: >>> cib_action_update: rsc_op 2: HA-variator_monitor_1000 on >>> slave.mfisoft.ru timed out >>> >> >> Is it possible that pacemaker is confused by time difference on master >> and slave? > > Timeouts are all calculated locally. So it shouldn't be an issue (aside from > trying to read the logs) > >> >>> Is there a corosync log file configured? That would have more detail on >>> slave. >>> >>>> Then line >>>> Failed actions: >>>> HA- variator _monitor_1000 on slave 'unknown error' (1): call=-1, >>>> status=Timed Out, last-rc-change='Sat Dec 21 03:59:45 2013', queued=0ms, >>>> exec=0ms appears in the crm status and recourses are switched. >>>> >>>> What is that timeout? Where I can change it? >>>> >>>> ------------------------ >>>> Kind regards, >>>> Dmitriy Matveichev. >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org Getting started: >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org Getting started: >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org Getting started: >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org