Hi,
We have a Mysql cluster which works fine when I have a single master ("A") and
slave ("B"). Failover is almost immediate and I am happy with this approach.
When we configured two additional slaves, strange things start to happen. From
time to time I am noticing that all slaves mysql instances are restarted and I
cannot figure out why.
I tried to find out what is happening, and this is how far I got:
There is a repeating sequence in the DC, which looks like this when everything
is fine:
Sep 10 01:45:42 oamgr crmd: [3385]: notice: do_state_transition: State
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=handle_response ]
Sep 10 01:45:42 oamgr crmd: [3385]: info: do_te_invoke: Processing graph 71358
(ref=pe_calc-dc-1378777542-165977) derived from
/var/lib/pengine/pe-input-3179.bz2
Sep 10 01:45:42 oamgr crmd: [3385]: notice: run_graph: ==== Transition 71358
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pengine/pe-input-3179.bz2): Complete
Sep 10 01:45:42 oamgr crmd: [3385]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Sep 10 01:47:42 oamgr crmd: [3385]: info: crm_timer_popped: PEngine Recheck
Timer (I_PE_CALC) just popped (120000ms)
Sep 10 01:47:42 oamgr crmd: [3385]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED
origin=crm_timer_popped ]
Sep 10 01:47:42 oamgr crmd: [3385]: info: do_state_transition: Progressed to
state S_POLICY_ENGINE after C_TIMER_POPPED
....
But
It looks somewhat different when I see the restarts:
....
Sep 10 01:51:42 oamgr crmd: [3385]: notice: do_state_transition: State
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=handle_response ]
Sep 10 01:51:42 oamgr crmd: [3385]: info: do_te_invoke: Processing graph 71361
(ref=pe_calc-dc-1378777902-165980) derived from
/var/lib/pengine/pe-input-3179.bz2
Sep 10 01:51:42 oamgr crmd: [3385]: notice: run_graph: ==== Transition 71361
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pengine/pe-input-3179.bz2): Complete
Sep 10 01:51:42 oamgr crmd: [3385]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Sep 10 01:52:45 oamgr crmd: [3385]: info: abort_transition_graph:
te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair,
id=status-oadb2-master-db-mysql.1, name=master-db-mysql:1, value=0, magic=NA,
cib=0.4829.3480) : Transient attribute: update
Sep 10 01:52:45 oamgr crmd: [3385]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
Sep 10 01:52:45 oamgr crmd: [3385]: info: abort_transition_graph:
te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair,
id=status-oadb2-readable, name=readable, value=0, magic=NA, cib=0.4829.3481) :
Transient attribute: update
.....
There is a transaction abort, and shortly after this, the slaves are restarted:
....
Sep 10 01:52:45 oamgr pengine: [3384]: notice: LogActions: Move db-mysql:1
(Slave oadb2 -> huoadb1)
Sep 10 01:52:45 oamgr pengine: [3384]: notice: LogActions: Move db-mysql:2
(Slave huoadb1 -> oadb2)
Sep 10 01:52:45 oamgr crmd: [3385]: notice: do_state_transition: State
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=handle_response ]
Sep 10 01:52:45 oamgr crmd: [3385]: info: do_te_invoke: Processing graph 71362
(ref=pe_calc-dc-1378777965-165981) derived from
/var/lib/pengine/pe-input-3180.bz2
Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action
148: notify db-mysql:0_pre_notify_stop_0 on oadb1
Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action
150: notify db-mysql:1_pre_notify_stop_0 on oadb2
Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action
151: notify db-mysql:2_pre_notify_stop_0 on huoadb1
Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action
152: notify db-mysql:3_pre_notify_stop_0 on huoadb2
Sep 10 01:52:45 oamgr pengine: [3384]: notice: process_pe_message: Transition
71362: PEngine Input stored in: /var/lib/pengine/pe-input-3180.bz2
Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 39:
stop db-mysql:1_stop_0 on oadb2
Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 43:
stop db-mysql:2_stop_0 on huoadb1
....
It appears that oadb2 and huoadb1 are replaced with each other (in terms of
db-mysql:1 and db-mysql:2 )? Does that make any sense?
It happens only when I have all 4 mysql nodes online. (oadb1, oadb2, huoadb1,
huoadb2). When I moved oadb2 to standby for a day, I did not see restarts.
Could someone help me troubleshoot this?
Mysql version is 5.1.66
Pacemaker 1.1.7
Corosync 1.4.2
Mysql RA is the latest from github
Thanks in advance,
Attila
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems