[Linux-ha-dev] mysql m/s rapid failover problem

Raoul Bhatia [IPAX] Fri, 11 Mar 2011 10:05:47 -0800

hi,

disclaimer: it has been some time since i thoroughly studied
the inner workings of mysql replication, but i think that my
current knowledge is sufficient to diagnose this problem.
perhaps i'm lacking skills to find the most decent solution ;)



PROBLEM DESCRIPTION:

i use the mysql ra rev 5ffc6ee2fae98adae8f2 from git [2]
i have my mysql m/s up and running:
> root@wdb01 ~ # crm status
...
> Online: [ wdb01 wdb02 ]
...
>  Master/Slave Set: ms-wdb-mysql
>      Masters: [ wdb01 ]
>      Slaves: [ wdb02 ]

my databases are in sync and i truncate the test table:
> # ssh wdb01c "mysql nagiostest -Be 'truncate table test_table'"
> # ssh wdb01c "mysql nagiostest -Be 'select * from test_table'"
> # ssh wdb02c "mysql nagiostest -Be 'select * from test_table'"

inserting some values and query from both nodes to verify that
everything is in sync:
> # ssh wdb01c "mysql nagiostest -Be 'insert into test_table values (1),(2)'"
> # ssh wdb01c "mysql nagiostest -Be 'select * from test_table'"
> id
> 1
> 2
> # ssh wdb02c "mysql nagiostest -Be 'select * from test_table'"
> id
> 1
> 2

great, now i do a migration that affects ms-wdb-mysql
> # crm resource migrate wdb-ip0 PT1M
> Migration will take effect until: 2011-03-11 17:21:50Z
> WARNING: Creating rsc_location constraint 'cli-standby-wdb-ip0' with a score 
> of -INFINITY for resource wdb-ip0 on wdb01.
>         This will prevent wdb-ip0 from running on wdb01 until the constraint 
> is removed using the 'crm_resource -U' command or manually with cibadmin
>         This will be the case even if wdb01 is the last node in the cluster
>         This message can be disabled with -Q

checking the status:
> # crm status
...
> Online: [ wdb01 wdb02 ]
...
>  Master/Slave Set: ms-wdb-mysql
>      Masters: [ wdb02 ]
>      Slaves: [ wdb01 ]

checking the data:
> # ssh wdb01c "mysql nagiostest -e 'select * from test_table'"
> id
> 1
> 2
> # ssh wdb02c "mysql nagiostest -e 'select * from test_table'"
> id
> 1
> 2

-> *everything is fine*
-> insert more values on the new master wdb02:
> # ssh wdb02c "mysql nagiostest -Be 'insert into test_table values 
> (3),(4),(5)'"
> # ssh wdb02c "mysql nagiostest -e 'select * from test_table'"
> id
> 1
> 2
> 3
> 4
> 5
> root@wdb01 ~ # ssh wdb01c "mysql nagiostest -e 'select * from test_table'"
> id
> 1
> 2
> 3
> 4
> 5

-> great!
-> another migration:
> root@wdb01 ~ # crm resource migrate wdb-ip0 PT1M
> Migration will take effect until: 2011-03-11 17:27:13Z
> WARNING: Creating rsc_location constraint 'cli-standby-wdb-ip0' with a score 
> of -INFINITY for resource wdb-ip0 on wdb02.
>         This will prevent wdb-ip0 from running on wdb02 until the constraint 
> is removed using the 'crm_resource -U' command or manually with cibadmin
>         This will be the case even if wdb02 is the last node in the cluster
>         This message can be disabled with -Q

check status:
> # crm status
...
> Online: [ wdb01 wdb02 ]
...
>      Masters: [ wdb01 ]
>      Slaves: [ wdb02 ]

-> fine
-> insert another value on wdb01 and query:
> # ssh wdb01c "mysql nagiostest -Be 'insert into test_table values (6)'"^C
> # ssh wdb01c "mysql nagiostest -e 'select * from test_table'"
> id
> 1
> 2
> 3
> 4
> 5
> 6
-> fine, now query wdb02:

> root@wdb01 ~ # ssh wdb02c "mysql nagiostest -e 'select * from test_table'"
> id
> 1
> 2
> 6

*error*

things get worse for the next failover, and so on :)


MY CONCLUSION:
the source and some thinking leads me to the following conclusion:

1. when a new master is promoted, its slave binlogs do not get purged
this is no problem in case of *one* failover, because

a. the slave will (hopefully) be consistant
b. gets promoted to master and
c. the old master (with no slave replication logs) will happily start
the replication starting with an empty replication log.


*but*
2. when the roles switch,
a. the consistent slave is promoted
b. the now new slave will use set_master() from the ra [1]
->
>     ocf_run $MYSQL $MYSQL_OPTIONS_LOCAL $MYSQL_OPTIONS_REPL \
>         -e "CHANGE MASTER TO MASTER_HOST='$master_host', \
>                              MASTER_USER='$OCF_RESKEY_replication_user', \
>                              MASTER_PASSWORD='$OCF_RESKEY_replication_passwd'"

c. and will then start the replication, starting with
the default RELAY_LOG_FILE and RELAY_LOG_POS=0, as well as
the MASTER_LOG_FILE and MASTER_LOG_POS=0 information from
master.info/relay-log.info leading to:

d. truncate table, insert 1,2 and finally insert 6



SOLVING THIS ISSUE?
i thought about how to solve this problem, and i currently came up with
the following idea:

1. besides using reset master upon failover, also use reset slave to
purge any existing relay logs upon changing the master.

2. maybe it is possible to save and communicate the correct
RELAY_LOG_FILE, RELAY_LOG_POS and/or MASTER_LOG_FILE, MASTER_LOG_POS
information in the cib?


your feedback is appreciated :)

thanks,
raoul

[1] http://dev.mysql.com/doc/refman/5.1/en/change-master-to.html
[2]
https://github.com/ClusterLabs/resource-agents/blob/5ffc6ee2fae98adae8f2927f87a9afdc7b8f1b6d/heartbeat/mysql
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

[Linux-ha-dev] mysql m/s rapid failover problem

Reply via email to