Hi All,

It's been a month since my last e-mail and I wanted to give an update on where things are at (unfortunately Bacula is still crashing on a weekly basis although I'm beginning to suspect this is due to my Postgres configuration and not Bacula itself). Please read on for details.

- As a refresher, I was running Bacula 5.0.3 until an accidental system update upgraded Bacula to 5.2.2 which is when I started seeing drastically different performance stats for Postgres and Bacula started crashing about once a week. - On 1/18 I upgraded Bacula from 5.2.2 to 5.2.3 through FreeBSD ports. Unfortunately this did not appear to make any difference and after upgrading on a Wednesday Bacula still crashed again the following Saturday.

When Bacula crashes, the following things are observed (stats obtained through munin monitoring): - Normally the graph of Active and Idle postgres connections peaks at the beginning of each backup schedule and then decreases as backups complete and are written to the database. When things start going wrong instead of decreasing the number of Active and Idle postgres connections increases at the beginning of each hour when a new backup schedule kicks off. At a certain point (just under 200 total postgres connections), a start setting a growing number of connections in a "Waiting for lock" state. Once I see this happening this seems to be the point of no return for Bacula and the bacula-dir process eventually crashes after running out of resources. - At the same time as the number of "Waiting for lock" connections starts increasing, the postgresql locks graph shows a growing number of "AccessShareLock", "RowExclusiveLock" and then "ShareRowExclusiveLock" which seems to be the killer. When I do a "status dir" from bconsole I see a large number of systems that appear stuck in the "Dir inserting Attributes" state. - The file table usage graph shows that at the time I observe performance/lock problems in postgres, the number of open files reported through munin jumped from 6K to 20K and continued to grow to 36K where it's hovering (I expect Bacula to crash soon).
    - The amount of physical memory in use jumps from about 1G to 3G.

*My Bacula/Postgres Environment:*
  -  Backup 1700 clients per day (~2TB/day)
- Multiple backup schedules starting on the hour every hour except for 2pm-6pm to allow for jobs to complete before the next backup window. - Director concurrency set to 100 (used to be 250 but has been cranked down due to problems) - System specs: 24GB memory, dual 6-core Intel Xeon E5649 2.53GHz processors, 1TB disk for database - Postgres config: max_connections = 1000, shared_buffers = 6144MB, work_mem = 16MB, maintenance_work_mem = 3072MB, max_files_per_process = 2000, wal_buffers = 8MB, checkpoint_segments = 128, checkpoint_timeout = 5min

Based on this information does anyone have any ideas on what I might try to resolve or troubleshoot this problem further? I can provide additional detail upon request.

Thanks all,
  Jenny  =)

On 01/05/2012 03:36 PM, Jenny Aquilino wrote:
Hi Martin,

Thank you for your response and suggestion on how to get visibility into
what is going on when the director appears to hang.  =)

I recently lowered the "Maximum Concurrent Jobs" directive in the
Director stanza and have avoided doing a reload or query of the director
in order to reduce the amount of "stuff" going on that could be
exacerbating the problem.  So far Bacula hasn't crashed however it did
run for 3 days immediately after doing the upgrade to 5.2.2 so I have a
sneaking suspicion Bacula may still crash when more fulls kick off this
weekend.  I will definitely use the gdb trick you mentioned to see what
is going on if/when that happens.  I also fixed the btraceback script
(it wasn't getting the correction location of the bacula-dir executable
passed to it) so when/if it does crash, I'll have something more to
provide you guys for clues.

In parallel I'm also working to compile Bacula 5.2.3 for FreeBSD using
the compile options from Dan Langille's port for 5.2.2 and if that goes
well will do the upgrade to see if that has any affect on the current
issues I have been seeing.

Thanks again and I'll keep you guys posted.

-Jenny  =)

On 01/05/2012 04:22 AM, Martin Simmons wrote:
On Tue, 03 Jan 2012 16:43:24 -0800, Jenny Aquilino said:
I have been happily running bacula-server-5.0.3 with postgresql-9.0.5_1
on a FreeBSD 8 server until an ill-fated chain of events led to me to
have to upgrade bacula-server to 5.2.2 before I had a chance to test it
in a development environment.  Although I followed the release notes by
upgrading the storage nodes and running 'update_bacula_tables,' since
the upgrade the server (director) has crashed twice in 5 days and I've
had to manually restart it a couple of times after normal queries like
"status storage=X" appear to hang.

Based on analysis of Munin graphs that report things like PostgreSQL
connections, PostgreSQL locks, process states, network connections
(netstat), and memory utilization it appears that something significant
has changed between 5.0.3 and 5.2.2 that is leaving a very high number
of PostgreSQL connections in Idle state instead of being closed.  When
Bacula crashes the PostgreSQL connections graph shows a large number of
connections in "Waiting for lock" state.  At the same time looking at
the PostgreSQL locks graph shows a very large number of
"ShareRowExclusive" and "AccessShare" locks which is behaviour we didn't
see prior to upgrading to 5.2.2.  If anyone would like a copy of these
graphs I can send them to you directly or post them to the mailing list
if that is allowed.

I know that 5.2.3 was released on 12/16 and saw that there was a bug fix
with update stats that I thought may be related to what I'm seeing
however have not updated because 5.2.3 has yet to make it into the
FreeBSD ports collection.  Based on the problem I described does this
sound like something that may have been fixed in 5.2.3?
I can't answer that because the bug in question (3419) is not in the Mantis
bug reporting system.


                                                           If not, does
anyone have other ideas on what I can do to troubleshoot?
You could attach gdb to the director process when it hangs and see what is
happening with

thread apply all bt

__Martin

------------------------------------------------------------------------------
Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
infrastructure or vast IT resources to deliver seamless, secure access to
virtual desktops. With this all-in-one solution, easily deploy virtual
desktops for less than the cost of PCs and save 60% on VDI infrastructure
costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox
_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

--
=======================================================
Jennifer R. Aquilino
S&T IT Support
Lawrence Livermore National Laboratory
Mail Stop L-556
7000 East Avenue
Livermore, CA 94550

Voice: (925)-424-4585
Fax:   (925)-423-8719
Email: aquili...@llnl.gov
========================================================

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to