Hi! I have a small corosync/pacemaker based cluster which consists of 4 nodes. 2 nodes are in standby mode, another 2 actually handle all the resources.
corosync ver. 1.4.7-1. pacemaker ver 1.1.11. os: ubuntu 12.04. Inside our production environment which has a plenty of free ram,cpu etc everything is working well. When I switch one node off all the resources move to another without any problems. And vice versa. That's what I need :) Our staging environment has rather weak hardware (that's ok - it's just staging :) ) and is rather busy. Sometimes it even doesn't have enough cpu or disk speed to be stable. When that happens some of cluster resources fail (which I consider to be normal), but also I can see the following crm output: Node db-node1: standby Node db-node2: standby Online: [ lb-node1 lb-node2 ] Pgpool2 (ocf::heartbeat:pgpool): FAILED (unmanaged) [ lb-node2 lb-node1 ] Resource Group: IPGroup FailoverIP1 (ocf::heartbeat:IPaddr2): Started [ lb-node2 lb-node1 ] As you can see the resource ocf::heartbeat:IPaddr2 is started on both nodes ( lb-node2 and lb-node1 ). But I can't figure out how than could happen. this is the output of my crm configure show: node db-node1 \ attributes standby=on node db-node2 \ attributes standby=on node lb-node1 node lb-node2 primitive Cachier ocf:site:cachier \ op monitor interval=10s timeout=30s depth=10 \ meta target-role=Started primitive FailoverIP1 IPaddr2 \ params ip=111.22.33.44 cidr_netmask=32 iflabel=FAILOVER \ op monitor interval=30s primitive Mailer ocf:site:mailer \ meta target-role=Started \ op monitor interval=10s timeout=30s depth=10 primitive Memcached memcached \ op monitor interval=10s timeout=30s depth=10 \ meta target-role=Started primitive Nginx nginx \ params status10url="/nginx_status" testclient=curl port=8091 \ op monitor interval=10s timeout=30s depth=10 \ op start interval=0 timeout=40s \ op stop interval=0 timeout=60s \ meta target-role=Started primitive Pgpool2 pgpool \ params checkmethod=pid \ op monitor interval=30s \ op start interval=0 timeout=40s \ op stop interval=0 timeout=60s group IPGroup FailoverIP1 \ meta target-role=Started colocation ip-with-cachier inf: Cachier IPGroup colocation ip-with-mailer inf: Mailer IPGroup colocation ip-with-memcached inf: Memcached IPGroup colocation ip-with-nginx inf: Nginx IPGroup colocation ip-with-pgpool inf: Pgpool2 IPGroup order cachier-after-ip inf: IPGroup Cachier order mailer-after-ip inf: IPGroup Mailer order memcached-after-ip inf: IPGroup Memcached order nginx-after-ip inf: IPGroup Nginx order pgpool-after-ip inf: IPGroup Pgpool2 property cib-bootstrap-options: \ expected-quorum-votes=4 \ stonith-enabled=false \ default-resource-stickiness=100 \ maintenance-mode=false \ dc-version=1.1.10-9d39a6b \ cluster-infrastructure="classic openais (with plugin)" \ last-lrm-refresh=1422438144 So the question is - does my config allow a resource like ocf::heartbeat:IPaddr2 to be started on multiple nodes simultaneously? Is it something that normally can happen? Or is it happening because of the shortage of computing power which i described earlier? : ) How can I prevent a thing like this from happening? Is it a case which normally is supposed to be solved by STONITH? Thanks in advance. -- Best regards, Sergey Arlashin _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org