Re: [Linux-HA] More monitor failures near G_SIG_dispatch delays

Dejan Muhamedagic Fri, 20 Jun 2008 07:09:14 -0700

On Fri, Jun 20, 2008 at 07:46:57AM -0400, Greg Haase wrote:
> On Fri, 2008-06-20 at 12:41 +0200, Dejan Muhamedagic wrote:
> > Hi,
> > 
> > On Thu, Jun 19, 2008 at 11:24:09AM -0400, Greg Haase wrote:
> > > Attached, please find an hb_report created for this particular setup for
> > > the timeframe when the issue occurred.
> > > 
> > > I realize that we're not supposed to sanitize these because it could
> > > obfuscate important information, but I've had to go through and sed
> > > replace a bunch of stuff for security reasons. I hope I didn't destroy
> > > anything useful to troubleshooting.
> > 
> > No problem.
> > 
> > There's nothing particularly interesting in the logs apart from
> > what you already reported. I still believe that this is a
> > performance problem. Did you notice that mysql is using a bit
> > more than 6G of memory:
> > 
> >   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+ COMMAND         
> >    
> > 20173 mysql     15   0 6884m 6.1g 5596 S   28 78.8   3770:47 mysqld         
> >     
> > 
> > According to the CPU time column it is also an extraordinarily
> > busy chap (see ps(1) and compare this time to the total time
> > since the database started, which you can find in ha-log).
> > 
> 
> This is a database server. It's a single purpose machine. For more
> information on tuning one of these things, you might look here:
> http://www.mysqlperformanceblog.com/2007/11/01/innodb-performance-optimization-basics/
> 
> A key quote would be: "innodb_buffer_pool_size 70-80% of memory is a
> safe bet. I set it to 12G on 16GB box."


> I have this setting at 5GB on an 8GB box, so I'm actually a little below
> that recommendation.

OK. Thanks for sharing this. Must say that I forgot that
databases in general prefer to manage memory by themselves.

> 3GB of RAM should be plenty for everything else on
> the machine, and indeed the machine is NOT swapping, so I'm still
> arguing that memory is NOT a problem.

There are ~125MB in swap. Though that's not much it may be a sign
that at some point there wasn't enough memory to satisfy all
needs.

> Regarding the CPU, we had one to instances where a few rogue queries ran
> it up to 80% and held it there for a few minutes, but other than that we
> have a few spikes an hour that peak at 40% and our mean is much lower
> than that.

Not sure what does 40% or 80% mean. Sorry, used to the uptime
measures, i.e. the number of runnable processes.

> > You have to investigate the performance and collect statistics
> > (sysstat, sar) and see how to relieve the database which seems to
> > be both CPU and memory bound. Perhaps to turn to mysql
> > forums/support.
> 
> I will take a look, but I'm skeptical.

I really can't offer any other way forward, short of complaining
on the LKML about the vm and scheduler.

> > A few notes on your config:
> > 
> > - You don't have stonith. And you have shared storage. That's
> >   very very dangerous. Indeed.
> 
> I am using DRBD in the typical primary/secondary setup. Maybe I'm not
> completely understanding things, but with the way dopd works to prevent
> split brain, etc., I thought this kind of set up was a bit more robust
> than having both servers point to a san.

I'm not an expert on DRBD, but I don't think that dopd can
entirely replace the node-level stonith.

> Additionally, I've got DRAC5 cards in both servers, and if setup and
> configuration were all that easy I'd have them included in the cluster.
> Right now a lot of searching the lists and the web hasn't really shown
> anyone that has definitely got this to work, let alone how they got it
> to work.

It shouldn't be difficult to setup given that you have support
(i.e. a plugin) for drac5. There's one for drac3 in heartbeat.
Don't know how different they are. I can also vaguely recall
several threads regarding one or the other drac devices. My guess
is that there are people running it, but so far nobody cared to
share the plugin. Or somebody did, but we missed it. Try to
search list archives.

> > - You have monitor ops defined for all resources, but not for the
> >   main one, i.e. the one which is actually offering a usable
> >   service. You could remove all and just monitor mysql and pingd
> >   (and for pingd it's enough to do that once every say 5
> >   minutes).
> > 
> 
> The reason that we don't monitor mysql any longer is that mysqld_safe
> monitors itself. When mysqld crashes, it restarts itself. While that
> process was going on, you had heartbeat detect that mysql was down and
> try to also restart it. With two services trying to start the same
> process, we were getting all kinds of weird issues with the pid file
> being open, etc.

Oops. Didn't know about this capability of mysql.

> That said, I agree with your point that we need to re-evaluate which
> resources we monitor and which ones we don't.

The only advice I can offer is to increase monitor timeouts
(again) and that way prevent odd failover. Reduce the number of
monitors (for example, since you run pingd there's probably no
need to watch the IPaddr resource). Try to watch those
heartbeat/lrmd warnings and see if you can correlate them to the
sysstat reports.

Thanks,

Dejan

> > - On failover, stopping mysql took close to a minute, and the
> >   timeout for the stop operation is set to two minutes. Perhaps
> >   increase this timeout. Failed stop operations are rather
> >   difficult to recover from and since you don't have stonith,
> >   such a failure would basically bring your database to a halt.
> > 
> > Good luck.
> > 
> > Dejan
> > 
> > 
> > > Also, I noticed that I almost _always_ get one of these G_SIG_dispatch
> > > delays in the logs at the time when the daily report information is
> > > output.
> > > 
> > > 
> > > 
> > > On Tue, 2008-06-17 at 14:29 -0400, Greg Haase wrote:
> > > > Last week I emailed the list regarding a node failover that occurred
> > > > when IPAddr monitor timed out. At the same time, my log was showing
> > > > G_SIG_dispatch delays in lrmd.  The thread ended in a petty discussion
> > > > over what was the proper time out value (although all the examples on
> > > > linux-ha.org show 3s here, it was suggested that I bump mine from 5s to
> > > > 15s).
> > > > 
> > > > A few minutes ago, I experienced another failover, this one due to drbd
> > > > monitor failure. None of my other logs show any kind of disk error. In
> > > > fact, my MySQL error log (located on the very drbd disk that failed)
> > > > shows the shutdown messages subsequently issued by heartbeat. Again, the
> > > > monitor failure occurs at the same time that a G_SIG_display delay
> > > > occurs.
> > > > 
> > > > Now does anyone have any idea why these errors may be occurring and is
> > > > there a way to resolve them.
> > > > 
> > > > Please see attached log snippet.
> > > > 
> > > > 
> > > > _______________________________________________
> > > > Linux-HA mailing list
> > > > [email protected]
> > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > > See also: http://linux-ha.org/ReportingProblems
> > 
> > 
> > > _______________________________________________
> > > Linux-HA mailing list
> > > [email protected]
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> 
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] More monitor failures near G_SIG_dispatch delays

Reply via email to