Sorry, I've had to ignore Heartbeat based clusters for the last few weeks...

There may have been a problem with 1.0.2, I never tested it with
Heartbeat, but my testing this week indicates the current code should
work.
So you might want to consider updating...

This looks suspicious though:
  heartbeat[1831]: 2009/03/18_14:18:03 WARN: Message hist queue is
filling up (377 messages in queue)
and would seem to indicate some sort of communications problem.

I'd suggest grabbing the latest Pacemaker code and submitting a bug if
you find it happens again.

Andrew

On Wed, Mar 18, 2009 at 18:29, Juha Heinanen <[email protected]> wrote:
> i set up the example apache cluster of document
>
> http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0
>
> but used mysql server instead of apache server.  crm of my test cluster
> looks like this:
>
> node $id="8df8447f-6ecf-41a7-a131-c89fd59a120d" lenny1
> node $id="f13aff7b-6c94-43ac-9a24-b118e62d5325" lenny2
> primitive drbd0 ocf:heartbeat:drbd \
>        params drbd_resource="drbd0" \
>        op monitor interval="59s" role="Master" timeout="30s" \
>        op monitor interval="60s" role="Slave" timeout="30s"
> primitive fs0 ocf:heartbeat:Filesystem \
>        params ftype="ext3" directory="/var/lib/mysql" device="/dev/drbd0" \
>        meta target-role="Started"
> primitive mysql-server lsb:mysql \
>        op monitor interval="10s" timeout="30s" start-delay="10s"
> primitive virtual-ip ocf:heartbeat:IPaddr2 \
>        params ip="192.98.102.10" broadcast="192.98.102.255" nic="eth1" 
> cidr_netmask="24" \
>        op monitor interval="21s" timeout="5s"
> group mysql-group fs0 mysql-server virtual-ip
> ms ms-drbd0 drbd0 \
>        meta clone-max="2" notify="true" globally-unique="false" 
> target-role="Started"
> colocation mysql-group-on-ms-drbd0 inf: mysql-group ms-drbd0:Master
> order ms-drbd0-before-mysql-group inf: ms-drbd0:promote mysql-group:start
> property $id="cib-bootstrap-options" \
>        dc-version="1.0.2-ec6b0bbee1f3aa72c4c2559997e675db6ab39160" \
>        default-resource-stickiness="1"
>
> initially both nodes were online, lenny2 being the master.  then i tried
> what happens when i reboot lenny1. when lenny1 was powered off, cluster
> looked correctly like this:
>
> # crm_mon -1
>
> ============
> Last updated: Wed Mar 18 14:12:09 2009
> Current DC: lenny2 (f13aff7b-6c94-43ac-9a24-b118e62d5325)
> Version: 1.0.2-ec6b0bbee1f3aa72c4c2559997e675db6ab39160
> 2 Nodes configured.
> 2 Resources configured.
> ============
>
> Node: lenny1 (8df8447f-6ecf-41a7-a131-c89fd59a120d): OFFLINE
> Node: lenny2 (f13aff7b-6c94-43ac-9a24-b118e62d5325): online
>
> Master/Slave Set: ms-drbd0
>    drbd0:0     (ocf::heartbeat:drbd):  Stopped
>    drbd0:1     (ocf::heartbeat:drbd):  Master lenny2
> Resource Group: mysql-group
>    fs0 (ocf::heartbeat:Filesystem):    Started lenny2
>    mysql-server        (lsb:mysql):    Started lenny2
>    virtual-ip  (ocf::heartbeat:IPaddr2):       Started lenny2
>
> when i powered lenny1 on again, i expected that after is becomes online
> again, but it was totally ignored.
>
> the log is below. versions of software are heartbeat 2.99.2 and
> pacemaker 1.0.2.
>
> any glues why lenny1 was ignored and my very first test to achieve high
> availability with heartbeat/pacemaker failed?  people on pacemaker list
> suspected ccm, which is part of heartbeat.
>
> -- juha
>
> ------------------------------------------
>
> this came to syslog when lenny1 was powered off:
>
> r...@lenny2:~# heartbeat[1831]: 2009/03/18_14:12:32 WARN: node lenny1: is dead
> heartbeat[1831]: 2009/03/18_14:12:32 info: Link lenny1:eth1 dead.
> crmd[1923]: 2009/03/18_14:12:32 notice: crmd_ha_status_callback: Status 
> update:
> Node lenny1 now has status [dead] (DC=true)
> crmd[1923]: 2009/03/18_14:12:32 info: crm_update_peer_proc: lenny1.ais is now
> offline
> crmd[1923]: 2009/03/18_14:12:32 info: te_graph_trigger: Transition 12 is now
> complete
> crmd[1923]: 2009/03/18_14:12:32 info: notify_crmd: Transition 12 status: done 
> -
> <null>
>
> and this when it was powered on again:
>
> heartbeat[1831]: 2009/03/18_14:12:56 info: Heartbeat restart on node lenny1
> heartbeat[1831]: 2009/03/18_14:12:56 info: Link lenny1:eth1 up.
> heartbeat[1831]: 2009/03/18_14:12:56 info: Status update for node lenny1:
> status init
> heartbeat[1831]: 2009/03/18_14:12:56 info: Status update for node lenny1:
> status up
> crmd[1923]: 2009/03/18_14:12:56 notice: crmd_ha_status_callback: Status 
> update:
> Node lenny1 now has status [init] (DC=true)
> crmd[1923]: 2009/03/18_14:12:56 info: crm_update_peer_proc: lenny1.ais is now
> online
> crmd[1923]: 2009/03/18_14:12:56 notice: crmd_ha_status_callback: Status 
> update:
> Node lenny1 now has status [up] (DC=true)
> heartbeat[1831]: 2009/03/18_14:13:26 info: Status update for node lenny1:
> status active
> crmd[1923]: 2009/03/18_14:13:26 notice: crmd_ha_status_callback: Status 
> update:
> Node lenny1 now has status [active] (DC=true)
> cib[1919]: 2009/03/18_14:13:26 info: cib_client_status_callback: Status 
> update:
> Client lenny1/cib now has status [join]
> cib[1919]: 2009/03/18_14:13:26 info: crm_update_peer_proc: lenny1.cib is now
> online
> heartbeat[1831]: 2009/03/18_14:13:30 WARN: 1 lost packet(s) for [lenny1] 
> [55:57]
> heartbeat[1831]: 2009/03/18_14:13:30 info: No pkts missing from lenny1!
> crmd[1923]: 2009/03/18_14:13:30 notice: crmd_client_status_callback: Status
> update: Client lenny1/crmd now has status [online] (DC=true)
> crmd[1923]: 2009/03/18_14:13:30 info: crm_update_peer_proc: lenny1.crmd is now
> online
> heartbeat[1831]: 2009/03/18_14:13:31 WARN: 1 lost packet(s) for [lenny1] 
> [59:61]
> heartbeat[1831]: 2009/03/18_14:13:31 info: No pkts missing from lenny1!
> crmd[1923]: 2009/03/18_14:13:33 WARN: crmd_ha_msg_callback: Ignoring HA 
> message
> (op=join_announce) from lenny1: not in our membership list (size=1)
> crmd[1923]: 2009/03/18_14:13:43 WARN: crmd_ha_msg_callback: Ignoring HA 
> message
> (op=vote) from lenny1: not in our membership list (size=1)
> cib[1919]: 2009/03/18_14:13:46 WARN: cib_peer_callback: Discarding
> cib_slave_all message (50) from lenny1: not in our membership
> cib[1919]: 2009/03/18_14:13:47 WARN: cib_peer_callback: Discarding cib_replace
> message (54) from lenny1: not in our membership
> cib[1919]: 2009/03/18_14:13:48 WARN: cib_peer_callback: Discarding
> cib_apply_diff message (58) from lenny1: not in our membership
> cib[1919]: 2009/03/18_14:13:50 WARN: cib_peer_callback: Discarding
> cib_apply_diff message (5c) from lenny1: not in our membership
> cib[1919]: 2009/03/18_14:13:51 WARN: cib_peer_callback: Discarding
> cib_apply_diff message (5e) from lenny1: not in our membership
> cib[1919]: 2009/03/18_14:13:51 WARN: cib_peer_callback: Discarding
> cib_apply_diff message (5f) from lenny1: not in our membership
> cib[1919]: 2009/03/18_14:13:51 WARN: cib_peer_callback: Discarding
> cib_apply_diff message (60) from lenny1: not in our membership
> cib[1919]: 2009/03/18_14:13:51 WARN: cib_peer_callback: Discarding
> cib_apply_diff message (61) from lenny1: not in our membership
> cib[1919]: 2009/03/18_14:13:51 WARN: cib_peer_callback: Discarding
> cib_apply_diff message (62) from lenny1: not in our membership
> heartbeat[1831]: 2009/03/18_14:16:01 info: all clients are now paused
> cib[1919]: 2009/03/18_14:16:27 info: cib_stats: Processed 32 operations
> (19062.00us average, 0% utilization) in the last 10min
> heartbeat[1831]: 2009/03/18_14:18:02 WARN: Message hist queue is filling up
> (376 messages in queue)
> heartbeat[1831]: 2009/03/18_14:18:03 WARN: Message hist queue is filling up
> (377 messages in queue)
> ...
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to