Hi, On Mon, Feb 11, 2013 at 12:21 PM, Andrew Beekhof <and...@beekhof.net> wrote: > On Mon, Feb 11, 2013 at 12:41 PM, Carlos Xavier > <cbas...@connection.com.br> wrote: >> Hi Andrew, >> >> tank you very much for your hints. >> >>> > Hi. >>> > >>> > We are running two clusters compounded of two machines. We are using DRBD >>> > + OCFS2 to make the common >>> filesystem. >> >> [snip] >> >>> > >>> > The clusters run nice with normal load except when doing backup of >>> > files or optimize of the databases. At this time we got a huge increment >>> > of data coming by the >>> mysqldump to the backup resource or from the resource mounted on /export. >>> > Sometimes when performing the backup or optimizing the database (done >>> > just on the mysql cluster), the Pacemaker declares a node dead (but >>> > its not) >>> >>> Well you know that, but it doesn't :) >>> It just knows it can't talk to its peer anymore - which it has to treat as >>> a failure. >>> >>> > and start the recovering process. When it happens we end up with two >>> > machines getting restarted and most of the times with a database crash >>> > :-( >>> > >>> > As you can see below, just about 30 seconds after the dump starts on >>> > diana the problem happens. >>> > ---------------------------------------------------------------- >> >> [snip] >> >>> > 04:27:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr) >>> > redirecting to systemctl Feb 6 04:28:31 diana lrmd: [2919]: info: RA >>> > output: (httpd:1:monitor:stderr) redirecting to systemctl Feb 6 >>> > 04:29:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr) >>> > redirecting to systemctl Feb 6 04:30:01 diana /USR/SBIN/CRON[1257]: >>> > (root) CMD (/root/scripts/bkp_database_diario.sh) >>> > Feb 6 04:30:31 diana lrmd: [2919]: info: RA output: >>> > (httpd:1:monitor:stderr) redirecting to systemctl Feb 6 04:31:31 >>> > diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr) >>> > redirecting to systemctl Feb 6 04:31:42 diana lrmd: [2919]: WARN: >>> > ip_intranet:0:monitor process >>> (PID 1902) timed out (try 1). Killing with signal SIGTERM (15). >>> >>> I'd increase the timeout here. Or put pacemaker into maintenance mode >>> (where it will not act on >>> failures) while you do the backups - but thats more dangerous. >>> >>> > Feb 6 04:31:47 diana corosync[2902]: [CLM ] CLM CONFIGURATION CHANGE >>> > Feb 6 04:31:47 diana corosync[2902]: [CLM ] New Configuration: >>> > Feb 6 04:31:47 diana corosync[2902]: [CLM ] #011r(0) ip(10.10.1.2) >>> > r(1) ip(10.10.10.9) >>> > Feb 6 04:31:47 diana corosync[2902]: [CLM ] Members Left: >>> > Feb 6 04:31:47 diana corosync[2902]: [CLM ] #011r(0) ip(10.10.1.1) >>> > r(1) ip(10.10.10.8) >>> > Feb 6 04:31:47 diana corosync[2902]: [CLM ] Members Joined: >>> > >>> >>> This appears to be the (almost) root of your problem. >>> The load is staving corosync of CPU (or possibly network bandwidth) and it >>> can no longer talk to its >>> peer. >>> Corosync then informs pacemaker who initiates recovery. >>> >>> I'd start by tuning some of your timeout values in corosync.conf >>> >> >> It should be the CPU, because I can see it going to 100% of usage on the >> cacti graph. >> Also we got two rings for corosync, one affected by the data flow ate the >> backup time and another with free badwidth. >> >> This is the totem session of my configuration. >> >> totem { >> version: 2 >> token: 5000 >> token_retransmits_before_loss_const: 10 >> join: 60 >> consensus: 6000 >> vsftype: none >> max_messages: 20 >> clear_node_high_bit: yes >> secauth: off >> threads: 0 >> rrp_mode: active >> interface { >> ringnumber: 0 >> bindnetaddr: 10.10.1.0 >> mcastaddr: 226.94.1.1 >> mcastport: 5406 >> ttl: 1 >> } >> interface { >> ringnumber: 1 >> bindnetaddr: 10.10.10.0 >> mcastaddr: 226.94.1.1 >> mcastport: 5406 >> ttl: 1 >> } >> } >> >> Can you kindly point what timer/counter should I play with? > > I would start by making these higher, perhaps double them and see what > effect it has. > > token: 5000 > token_retransmits_before_loss_const: 10 > >> What are the reasonable values for them? I got scared with this warning "It >> is not recommended to alter this value without guidance >> from the corosync community." >> Is there any benefits of changing the rrp_mode from active to passive?
rrp_mode: passive is better tested than active. That's the only real benefit. > > Not something I've played with, sorry. > >> Should it be done on both hosts? > > It should be the same I would imagine. > >> >>> > ---------------------------------------------------------------- >>> > >>> > Feb 6 04:30:32 apolo lrmd: [2855]: info: RA output: >>> > (httpd:0:monitor:stderr) redirecting to systemctl Feb 6 04:31:32 >>> > apolo lrmd: [2855]: info: RA output: (httpd:0:monitor:stderr) redirecting >>> > to systemctl Feb 6 >>> 04:31:41 apolo corosync[2848]: [TOTEM ] A processor failed, forming new >>> configuration. >>> > Feb 6 04:31:47 apolo corosync[2848]: [CLM ] CLM CONFIGURATION CHANGE >>> > Feb 6 04:31:47 apolo corosync[2848]: [CLM ] New Configuration: >>> > Feb 6 04:31:47 apolo corosync[2848]: [CLM ] #011r(0) ip(10.10.1.1) >>> > r(1) ip(10.10.10.8) >>> > Feb 6 04:31:47 apolo corosync[2848]: [CLM ] Members Left: >>> > Feb 6 04:31:47 apolo corosync[2848]: [CLM ] #011r(0) ip(10.10.1.2) >>> > r(1) ip(10.10.10.9) >>> > Feb 6 04:31:47 apolo corosync[2848]: [CLM ] Members Joined: >>> > Feb 6 04:31:47 apolo corosync[2848]: [pcmk ] notice: >>> > pcmk_peer_update: Transitional membership event on ring 304: memb=1, >>> > new=0, >>> > lost=1 >> >> [snip] >> >>> > >>> > After lots of log apolo asks diana to reboot and sometime after that it >>> > got rebooted too. >>> > We had an old cluster with heartbeat and DRBD used to cause it on that >>> > system but now looks like >>> Pacemaker is the guilt. >>> > >>> > Here is my Pacemaker and DRBD configuration >>> > http://www2.connection.com.br/cbastos/pacemaker/crm_config >>> > http://www2.connection.com.br/cbastos/pacemaker/drbd_conf/global_commo >>> > n.setup >>> > http://www2.connection.com.br/cbastos/pacemaker/drbd_conf/backup.res >>> > http://www2.connection.com.br/cbastos/pacemaker/drbd_conf/export.res >>> > >>> > And more detailed logs >>> > http://www2.connection.com.br/cbastos/pacemaker/reboot_apolo >>> > http://www2.connection.com.br/cbastos/pacemaker/reboot_diana >>> > >> >> Best regards, >> Carlos. >> >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org -- Dan Frincu CCNA, RHCE _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org