Re: [Bacula-users] backup slowdown (mysqld) after tape autochange

Martin Simmons Wed, 15 Dec 2010 05:52:56 -0800

>>>>> On Wed, 15 Dec 2010 10:15:09 +0100, Robert Wirth said:
> 
> > 
> > On Tue, December 14, 2010 11:48 am, Robert Wirth wrote:
> > > Hi,
> > >
> > > strange problem.  Here's some hardware where Bacula has been running
> > > successfully for ca. 5 years.  It was release 1.38.11 under Solaris 10x86.
> > >
> > > Last month, we had a system disk crash on the backup system.  No backup
> > > datas have been lost.  We just had to reinstall the backup system.
> > > Since this was our only Solaris x86 system, we decided to migrate
> > > to Linux and to a newer Bacula release.  Until the repaired hardware
> > > was present, we started with a virtualized new system, just for the
> > > daily incremental backups to disk volumes.
> > >
> > > Since most of our actual systems are Ubuntu Hardy server LTS, we
> > > choosed Bacula 2.2.8 of this distribution as our new version (well,
> > > it's old, but 1.38.11 was running well, and 2.2.8 was the default)
> > >
> > > We upgraded Bacula's mysql database with the corresponding script
> > > from 1.38.11 to 2.2.8.  We imported the updated DB using mysql_dump
> > > into the new system which has MySQL 5.1.41 and Linux Kernel 2.6.32
> > > The virtualized system worked well all the time.
> > >
> > > Now, the hardware version of the system is ready, and a yearly full
> > > backup, which goes directly to tape, is imminent.
> > >
> > > And now, the strange things are coming...
> > >
> > >
> > > /* The system is a 2x2 core AMD Opteron system, 4 GB RAM, 6xLSI SCSI U320
> > > Megaraid with seperated channels for external disks, tape readers and
> > > autochanger.  23 TB disk storage on external RAIDs, autochanger and
> > > HP-readers for LTO-3 tapes.   System: see above. */
> > >
> > >
> > > NOW BACKING UP...
> > >
> > > Starting a bunch of full backup jobs which fit into 1 SINGLE TAPE
> > > produces NO PROBLEMS:  the jobs start, run and write, and terminate
> > > within a usual span of time.  In so doing, I can backup a dozen
> > > systems with totally 360 GB on one tape in a few hours.
> > >
> > >
> > > FACING THE PROBLEM...
> > >
> > > Starting a bunch of full backup jobs that DO NOT FIT into 1 single
> > > tape proceeds like follows  (with a fresh tape forced by setting the
> > > former one to readonly):
> > >
> > > - first, the jobs run well and write their data to the first fresh tape
> > >   of the corresponding pool.  Speed is similar as known from the old OS.
> > >
> > > - when the tape is full with around 600GB of data, it is marked as
> > >   Full, being unloaded, and the next free tape of the pool is loaded.
> > >
> > > - from this moment on, writing to the new fresh tape becomes incredibly
> > >   slow (4 GB/hour) and mysqld has constantly 95%-100% CPU load.
> > >   No other process has an important load, and the mysql load isn't
> > >   represented in the system's load values:
> > >
> > > Cpu(s):  3.3%us,  2.2%sy,  0.0%ni, 91.6%id,  2.1%wa,  0.1%hi,  0.7%si,
> > > 0.0%st
> > > Mem:   3961616k total,  3850072k used,   111544k free,    17532k buffers
> > > Swap:  3906552k total,        0k used,  3906552k free,  3579956k cached
> > >
> > >   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> > >  1356 mysql     20   0  144m  31m 2376 S   98  0.8 163:57.79 mysqld
> > >     1 root      20   0  2620  948  528 S    0  0.0   0:00.63 init
> > >     2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd
> > >  ....
> > >
> > > The only further effect I can see is that the table "bacula.JobMedia" is
> > > growing.   No errors in system log, no mysql errors, nor in Baculas log.
> > >
> > > What I mainly don't understand is why this happens after a tape change.
> > > The MaxSpoolSize is 32GB, and I'm backing up 7 systems.  Each of them
> > > had several spool steps during the first tape.
> > >
> > >>From the view of Bacula and its program logic, what has changed when
> > > the tape has been changed?  I guess it's all the same:  spooling data,
> > > writing them to tape and update the catalog, regardless of first, second
> > > or later tape...?!?
> > 
> > What do you see under Running Jobs in the 'status dir' output before and
> > after the first tape has filled?
> > 
> > If you have only the 'after' just now, that might be interesting.
> 
> Will try to see this on next try.
> 
> /* Actual state:  the second tape is loaded and used since yesterday 
> 02:14 p.m., and it really has been written 29 GB :-((( */
> 
> 
> And, I wrote wrong.  There IS something strange in bacula log.  Oooh.
> 
> Yesterday, I started the 7 not-fitting big backup jobs around 9:45 a.m.  
> The first tape has been filled with 540 GB until 02:45 p.m.  
> 
> |     833 | 000090L1   | Full      |       1 | 543,602,949,120 |      544 |   
> 63,072,000 |       0 |    1 |         1 | LTO-3     | 2010-12-14 14:45:52 |
> 
> During this time, I could follow the spooling via messages on bconsole. 
> It all looked correct:  parallel spooling and writing of data.
> 
> /* PARENTHESIS: 
> Actual state:  the second tape was loaded yesterday 02:45 p.m., 
> and it really has been written 29 GB since then during 18 hours :-((( 
> |     834 | 000098L1   | Append    |       1 |  29,998,080,000 |       30 |   
> 63,072,000 |       0 |    4 |         1 | LTO-3     | 2010-12-15 08:14:46 |
> END OF PARENTHESIS */
> 
> 
> In the log, there's only information about the job which I started
> first (lnv-102).  No word about the other 6 jobs and their spooling!  
> And the logfile is interrupted at 01:22 p.m., more than one hour before 
> the tape was full and changed:
> 
> 14-Dec 09:44 bup-serv-dir JobId 93518: Start Backup JobId 93518, 
> Job=lnv-102.2010-12-14_09.44.00
> 14-Dec 09:46 bup-serv-dir JobId 93518: Using Device "Tape1Left"
> 14-Dec 09:47 bup-serv-sd JobId 93518: Spooling data ...
> 14-Dec 10:21 bup-serv-sd JobId 93518: User specified spool size reached.
> 14-Dec 10:21 bup-serv-sd JobId 93518: Writing spooled data to Volume. 
> Despooling 32,000,031,721 bytes ...
> 14-Dec 11:24 bup-serv-sd JobId 93518: Despooling elapsed time = 00:17:28, 
> Transfer rate = 30.53 M bytes/second
> 14-Dec 11:24 bup-serv-sd JobId 93518: Spooling data again ...
> 14-Dec 11:43 bup-serv-sd JobId 93518: User specified spool size reached.
> 14-Dec 11:43 bup-serv-sd JobId 93518: Writing spooled data to Volume. 
> Despooling 32,000,032,040 bytes ...
> 14-Dec 12:58 bup-serv-sd JobId 93518: Despooling elapsed time = 00:14:45, 
> Transfer rate = 36.15 M bytes/second
> 14-Dec 12:58 bup-serv-sd JobId 93518: Spooling data again ...
> 14-Dec 13:22 bup-serv-sd JobId 93518: User specified spool size 
> [END OF FILE]
> 
> During this time, 17 spool sessions of all 7 systems, each 32 GB, have 
> been written to tape, not just 2 spools of one system, as logged here.


It looks like the log stream isn't being flushed to the file.  Can you try the
bconsole messages command instead?


> "status dir" tells me that this one job is terminated, while the other
> 6 jobs are still marked as running:
> 
>  93518 Full    lnv-102.2010-12-14_09.44.00 has terminated
> 
> But, nevertheless, the last spool file (25GB) of this job is still
> present in spool directory:
> 
> -rw-r----- 1 root bacula 25590119686 2010-12-14 14:48 
> bup-serv-sd.data.93518.lnv-102.2010-12-14_09.44.00.Tape1Left.spool
> 
> And the spool metadata file has just 100MB:
> 
> -rw-r----- 1 root bacula 103063552 2010-12-14 14:48 
> bup-serv-sd.attr.lnv-102.2010-12-14_09.44.00.5.spool
> 
> bacula-sd helds these two files open, as lsof shows:
> 
> bacula-sd 2150 root   13u   REG  251,5 25590119686     13 
> /var/services/bacula/spool1/bup-serv-sd.data.93518.lnv-102.2010-12-14_09.44.00.Tape1Left.spool
> bacula-sd 2150 root   14u   REG  251,1   103063552 131160 
> /var/services/bacula/work/bup-serv-sd.attr.lnv-102.2010-12-14_09.44.00.5.spool
> 
> but no more connection to the FD of that system.  This is ok, since all
> data has been spooled.
> For the 6 other jobs, connections AND spool files are present.  Correct.
> 
> 
> Spooling of lnv-102 is terminated, but the last spool isn't yet written
> to tape.   Then, t here should be NO HEAVY mysql operations yet, since no 
> job is completely despooled yet.  Right?
> 
> Where comes the mysql load from?  Reminder that the bunch of tape-fitting
> jobs wrote 360GB in a few hours, inclusively DB operations.

Can you turn on mysql query logging?

__Martin

------------------------------------------------------------------------------
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d
_______________________________________________
Bacula-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] backup slowdown (mysqld) after tape autochange

Reply via email to