Re: [Linux-HA] More monitor failures near G_SIG_dispatch delays

Greg Haase Fri, 20 Jun 2008 04:47:29 -0700

On Fri, 2008-06-20 at 12:41 +0200, Dejan Muhamedagic wrote:
> Hi,
> 
> On Thu, Jun 19, 2008 at 11:24:09AM -0400, Greg Haase wrote:
> > Attached, please find an hb_report created for this particular setup for
> > the timeframe when the issue occurred.
> > 
> > I realize that we're not supposed to sanitize these because it could
> > obfuscate important information, but I've had to go through and sed
> > replace a bunch of stuff for security reasons. I hope I didn't destroy
> > anything useful to troubleshooting.
> 
> No problem.
> 
> There's nothing particularly interesting in the logs apart from
> what you already reported. I still believe that this is a
> performance problem. Did you notice that mysql is using a bit
> more than 6G of memory:
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+ COMMAND           
>  
> 20173 mysql     15   0 6884m 6.1g 5596 S   28 78.8   3770:47 mysqld           
>   
> 
> According to the CPU time column it is also an extraordinarily
> busy chap (see ps(1) and compare this time to the total time
> since the database started, which you can find in ha-log).
>

This is a database server. It's a single purpose machine. For more
information on tuning one of these things, you might look here:
http://www.mysqlperformanceblog.com/2007/11/01/innodb-performance-optimization-basics/

A key quote would be: "innodb_buffer_pool_size 70-80% of memory is a
safe bet. I set it to 12G on 16GB box."

I have this setting at 5GB on an 8GB box, so I'm actually a little below
that recommendation. 3GB of RAM should be plenty for everything else on
the machine, and indeed the machine is NOT swapping, so I'm still
arguing that memory is NOT a problem.

Regarding the CPU, we had one to instances where a few rogue queries ran
it up to 80% and held it there for a few minutes, but other than that we
have a few spikes an hour that peak at 40% and our mean is much lower
than that.

> You have to investigate the performance and collect statistics
> (sysstat, sar) and see how to relieve the database which seems to
> be both CPU and memory bound. Perhaps to turn to mysql
> forums/support.

I will take a look, but I'm skeptical.

> 
> A few notes on your config:
> 
> - You don't have stonith. And you have shared storage. That's
>   very very dangerous. Indeed.

I am using DRBD in the typical primary/secondary setup. Maybe I'm not
completely understanding things, but with the way dopd works to prevent
split brain, etc., I thought this kind of set up was a bit more robust
than having both servers point to a san.

Additionally, I've got DRAC5 cards in both servers, and if setup and
configuration were all that easy I'd have them included in the cluster.
Right now a lot of searching the lists and the web hasn't really shown
anyone that has definitely got this to work, let alone how they got it
to work.
> 
> - You have monitor ops defined for all resources, but not for the
>   main one, i.e. the one which is actually offering a usable
>   service. You could remove all and just monitor mysql and pingd
>   (and for pingd it's enough to do that once every say 5
>   minutes).
> 

The reason that we don't monitor mysql any longer is that mysqld_safe
monitors itself. When mysqld crashes, it restarts itself. While that
process was going on, you had heartbeat detect that mysql was down and
try to also restart it. With two services trying to start the same
process, we were getting all kinds of weird issues with the pid file
being open, etc.

That said, I agree with your point that we need to re-evaluate which
resources we monitor and which ones we don't.

> - On failover, stopping mysql took close to a minute, and the
>   timeout for the stop operation is set to two minutes. Perhaps
>   increase this timeout. Failed stop operations are rather
>   difficult to recover from and since you don't have stonith,
>   such a failure would basically bring your database to a halt.
> 
> Good luck.
> 
> Dejan
> 
> 
> > Also, I noticed that I almost _always_ get one of these G_SIG_dispatch
> > delays in the logs at the time when the daily report information is
> > output.
> > 
> > 
> > 
> > On Tue, 2008-06-17 at 14:29 -0400, Greg Haase wrote:
> > > Last week I emailed the list regarding a node failover that occurred
> > > when IPAddr monitor timed out. At the same time, my log was showing
> > > G_SIG_dispatch delays in lrmd.  The thread ended in a petty discussion
> > > over what was the proper time out value (although all the examples on
> > > linux-ha.org show 3s here, it was suggested that I bump mine from 5s to
> > > 15s).
> > > 
> > > A few minutes ago, I experienced another failover, this one due to drbd
> > > monitor failure. None of my other logs show any kind of disk error. In
> > > fact, my MySQL error log (located on the very drbd disk that failed)
> > > shows the shutdown messages subsequently issued by heartbeat. Again, the
> > > monitor failure occurs at the same time that a G_SIG_display delay
> > > occurs.
> > > 
> > > Now does anyone have any idea why these errors may be occurring and is
> > > there a way to resolve them.
> > > 
> > > Please see attached log snippet.
> > > 
> > > 
> > > _______________________________________________
> > > Linux-HA mailing list
> > > [email protected]
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> 
> 
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] More monitor failures near G_SIG_dispatch delays

Reply via email to