Hi Brad, do you run wsrep_desync=ON on the node before running the backup? It seems like a case of flow control triggering.
On Fri, Dec 11, 2015 at 1:29 AM Brad Jorgensen <[email protected]> wrote: > We have a three node (db1, db2, db3) galera cluster with MariaDB 10.0.22 > on CentOS 6.7. A couple days ago I upgraded to 10.1.9. Xtrabackup > (2.3.2) is run every night on each node at 1am, 2am, and 3am > respectively. Before the backup starts, the node is desynced. > > The first night after upgrading to 10.1.9, the problem began. All > connections were going to db1 until the backup started when db1 was > removed from the routing pool and new connections began going to db2. > At that time there was little traffic aside from the backup; much of it > is probably monitoring queries. Our monitoring shows that running > threads went from about 2 just before the backup finished around 1:32am > to about 150 just after. At the same time, the running threads on db2 > went from 1 to 10. After the backup completed, all new connections were > going to db1 again. The running threads on db1 continued to slowly grow > until the queries that are stuck took up all of the server processes on > our application servers and we were alerted around 3:50am. I checked > the process list and almost all of the queries were in the "query end" > state and I think they were all write queries. I tried to kill most of > them but they just stayed in the same state. I restarted db2 to try to > kick the cluster without losing data. I had to force the shutdown since > three threads never ended after about 10 minutes of waiting. The > running threads on db1 returned to normal. db2 had do do a full SST > which took until 6:05 to complete. At that time, the running processes > on db1 began to increase again. When db2 was back up I downgraded to > 10.1.22 and rejoined it to the cluster. I tried to restart db1, but it > needed a full SST so I left it down. A bit later I took down db3 to > downgrade it, too at that went fine. The cluster was fine through the > day during normal business operation. > > The next night only db2 and db3 were up and were running 10.0.22. What > appears to be the same problem started at 3:31am, when xtrabackup paused > galera ("Provider paused at > 8c53b634-9514-11e4-b8bd-dab05673fb36:875650526") on db3 for the backup. > At that time the running threads on db2 shot up and slowly increased > until I shut it down at 6:28. I had to kill it again due to three > threads on ending. db3 showed nothing unusual in the logs. I got the > innodb engine status from db2 three times a few minutes apart before I > restarted; they are attached. > > Additionally, I attached an excerpt from the logs on db2 and db3 during > the second incident and the my.cnf from one of the servers, it's > basically the same for the others. I'm working on getting a clean set > of logs from the first incident, but from what I initially saw, they are > basically the same as the second set of logs. I'm ready if the problem > arises again and I'll try to get more information including SHOW GLOBAL > STATUS. > > Our environment hasn't changed for at least a month and the issue first > appeared after upgrading to 10.1.9, but since it didn't go away after > downgrading, I'm not sure where the issue is. > > I found a few mentions of what might be the same problem: > http://marialog.archivist.info/2015-04-03.txt > https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1149755 > > _______________________________________________ > Mailing list: https://launchpad.net/~maria-discuss > Post to : [email protected] > Unsubscribe : https://launchpad.net/~maria-discuss > More help : https://help.launchpad.net/ListHelp > -- Guillaume Lefranc Remote DBA Services Manager MariaDB Corporation
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : [email protected] Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp

