Re: [Bacula-users] Timeout (?) problems with some Full backups

2009-08-13 Thread Nick Lock
Thanks very much for the help everyone, I do appreciate it. And quite
possibly - you've fixed it :)

From the top:

John Drescher wrote:

 I would make the heartbeat interval much shorter.

Thinking that this might be a good idea, I dropped all the Heartbeat
Intervals to 1 minute and left the same backup to run again along with
the usual overnight backups. (hooray for bash, I can push a new
bacula-fd.conf to all the servers in less than 2 minutes!) As it was the
end of the day for me, I thought letting Bacula do it's stuff in peace
and taking a fresh look this morning would be best.

I assume that a 1 minute heartbeat won't be too detrimental to the
network - compared to the flood of backup data that's coming into the
same server? (I run the Dir and SD on the same server)

I'll show the results below...

 Also I am interested on why the backup rate is this slow. Slow
 network?

This particular machine is a VM on a quite highly loaded host. We put
lower priority, less demanding, VM hosts together on one host in order
to give more resources to higher priority tasks. The effective CPU speed
in the VM is quite low, so I think that the compression which the
bacula-fd task performs on the data is a rate-limiting step. We *can*
get line-speed transfers from other servers, it just happens that this
one is a little slower!



Josh Fisher wrote:

 Something is dropping the Dir-FD connection. A router in between FD
 and Dir, perhaps.

Yes, it seems that something is dropping the connection. For peace of
mind I'll have a word with our networking engineer as to what the
current connection timeout values are on the hardware I don't directly
control. It's also occured to me that I'll have to check what the
current tcp timeout settings are on the server - maybe a recent update
changed them and I didn't notice...? (Full backups have been running OK
for a few months prior to my current problems!)

However, it might be moot after last night's attempts with a 1 minute
heartbeat (see below).


John Lockard wrote:

 While the job is running, keep an eye on the system which houses
 your MySQL database and make sure that it isn't filling up a
 partition with temp data.  I was running into a similar problem
 and needed to move my mysql_tmpdir (definable in /etc/my.cnf)
 to another location.

I'd never even considered that! We're using Postgres for the db, the
temp directory for that is
currently /var/lib/postgresql/8.3/main/base/pgsql_tmp and has 33Gb free.
I don't think that it's the problem this time, but I'll certainly have
to keep it in mind in the future...



OK - the results. Here's the same backup that I showed failing in my
first post:
---

12-Aug 16:37 exa-bacula-dir JobId 5518: Start Backup JobId 5518,
Job=backup_scavenger.2009-08-12_16.37.28.03
12-Aug 16:37 exa-bacula-dir JobId 5518: Purging oldest volume
scavenger-full-1191
12-Aug 16:37 exa-bacula-dir JobId 5518: 1 File on Volume
scavenger-full-1191 purged from catalog.
12-Aug 16:37 exa-bacula-dir JobId 5518: There are no more Jobs
associated with Volume scavenger-full-1191. Marking it purged.
12-Aug 16:37 exa-bacula-dir JobId 5518: All records pruned from Volume
scavenger-full-1191; marking it Purged
12-Aug 16:37 exa-bacula-dir JobId 5518: Using Device
FileStorageScavenger
12-Aug 16:37 exa-bacula-sd JobId 5518: Recycled volume
scavenger-full-1191 on device
FileStorageScavenger (/srv/bacula/volume/web-scavenger), all previous
data lost.
12-Aug 16:37 exa-bacula-dir JobId 5518: Max Volume jobs exceeded.
Marking Volume scavenger-full-1191 as Used.
12-Aug 18:16 exa-bacula-sd JobId 5518: Job write elapsed time =
01:38:34, Transfer rate = 373.4 K bytes/second
12-Aug 18:19 exa-bacula-dir JobId 5518: Bacula exa-bacula-dir 2.4.4
(28Dec08): 12-Aug-2009 18:19:42
  Build OS:   x86_64-pc-linux-gnu debian lenny/sid
  JobId:  5518
  Job:backup_scavenger.2009-08-12_16.37.28.03
  Backup Level:   Full
  Client: scavenger 2.4.4 (28Dec08)
i486-pc-linux-gnu,debian,5.0
  FileSet:full-scavenger 2009-04-16 15:58:05
  Pool:   scavenger-full (From Job FullPool override)
  Storage:FileScavenger (From Job resource)
  Scheduled time: 12-Aug-2009 16:37:24
  Start time: 12-Aug-2009 16:37:30
  End time:   12-Aug-2009 18:19:42
  Elapsed time:   1 hour 42 mins 12 secs
  Priority:   10
  FD Files Written:   81,883
  SD Files Written:   81,883
  FD Bytes Written:   2,197,458,673 (2.197 GB)
  SD Bytes Written:   2,208,651,895 (2.208 GB)
  Rate:   358.4 KB/s
  Software Compression:   28.4 %
  VSS:no
  Storage Encryption: no
  Volume name(s): scavenger-full-1191
  Volume Session Id:  1
  Volume Session Time:1250091413
  Last Volume Bytes:  2,212,931,408 (2.212 GB)
  Non-fatal FD errors:0
  

[Bacula-users] Timeout (?) problems with some Full backups

2009-08-12 Thread Nick Lock
Hello list!

Sorry to trouble you with what's probably a simple problem, but I'm now
looking at the very real possibility of wiping all our backups clean and
starting from scratch if I can't fix it... :(

I'm having problems with some Full backups, which run for between 1 and
2 hours, appearing to time out after the data transfer from the FD to
the SD. The error message (shown below) shows that the data transfer
completes, often in about 1hr30min, and then Bacula does nothing until
the job has been running for 2 hours at which point it gives an FD
error.

Other Full backups (which don't take as long) run correctly, and for
most of the time Inc and Diff backups also run correctly. However, a
small % of backups will fail at random, also with FD errors but at
random times-elapsed during the job... this I have been ascribing to
network fluctuations! The difference is that re-running these random
failures will succeed, whilst this particular Full failure doesn't! ;)

I've already tried setting a heartbeat interval of 20 minutes in the
FD/SD and DIR conf files (thinking that the FD - Dir connection was
timing out) but this doesn't change anything.

In the time between the data transfer finishing and the timeout,
Postgres has an open connection with a COPY batch FROM STDIN
transaction in progress, which at the timeout produces errors in the
Postgres log that I have also shown below.

I'm happy to post portions of the conf files if needed, but they're huge
and might well lead to tl;dr!

Any suggestions as to how I can troubleshoot this further would be most
appreciated!

Nick Lock.


-
12-Aug 14:18 exa-bacula-dir JobId 5514: Start Backup JobId 5514,
Job=backup_scavenger.2009-08-12_14.18.06.04
12-Aug 14:18 exa-bacula-dir JobId 5514: There are no more Jobs
associated with Volume scavenger-full-1250. Marking it purged.
12-Aug 14:18 exa-bacula-dir JobId 5514: All records pruned from Volume
scavenger-full-1250; marking it Purged
12-Aug 14:18 exa-bacula-dir JobId 5514: Recycled volume
scavenger-full-1250
12-Aug 14:18 exa-bacula-dir JobId 5514: Using Device
FileStorageScavenger
12-Aug 14:18 exa-bacula-sd JobId 5514: Recycled volume
scavenger-full-1250 on device
FileStorageScavenger (/srv/bacula/volume/web-scavenger), all previous
data lost.
12-Aug 14:18 exa-bacula-dir JobId 5514: Max Volume jobs exceeded.
Marking Volume scavenger-full-1250 as Used.
12-Aug 15:49 exa-bacula-sd JobId 5514: Job write elapsed time =
01:31:41, Transfer rate = 401.4 K bytes/second
12-Aug 16:18 exa-bacula-dir JobId 5514: Fatal error: Network error with
FD during Backup: ERR=Connection reset by peer
12-Aug 16:18 exa-bacula-dir JobId 5514: Fatal error: No Job status
returned from FD.
12-Aug 16:18 exa-bacula-dir JobId 5514: Error: Bacula exa-bacula-dir
2.4.4 (28Dec08): 12-Aug-2009 16:18:09
  Build OS:   x86_64-pc-linux-gnu debian lenny/sid
  JobId:  5514
  Job:backup_scavenger.2009-08-12_14.18.06.04
  Backup Level:   Full
  Client: scavenger 2.4.4 (28Dec08)
i486-pc-linux-gnu,debian,5.0
  FileSet:full-scavenger 2009-04-16 15:58:05
  Pool:   scavenger-full (From Job FullPool override)
  Storage:FileScavenger (From Job resource)
  Scheduled time: 12-Aug-2009 14:18:03
  Start time: 12-Aug-2009 14:18:09
  End time:   12-Aug-2009 16:18:09
  Elapsed time:   2 hours 
  Priority:   10
  FD Files Written:   0
  SD Files Written:   81,883
  FD Bytes Written:   0 (0 B)
  SD Bytes Written:   2,208,578,175 (2.208 GB)
  Rate:   0.0 KB/s
  Software Compression:   None
  VSS:no
  Storage Encryption: no
  Volume name(s): scavenger-full-1250
  Volume Session Id:  5
  Volume Session Time:1250080970
  Last Volume Bytes:  2,212,857,316 (2.212 GB)
  Non-fatal FD errors:0
  SD Errors:  0
  FD termination status:  Error
  SD termination status:  OK
  Termination:*** Backup Error ***

-
Postgres Log:

2009-08-12 16:18:09 BST ERROR:  unexpected message type 0x58 during COPY
from stdin
2009-08-12 16:18:09 BST CONTEXT:  COPY batch, line 81884: 
2009-08-12 16:18:09 BST STATEMENT:  COPY batch FROM STDIN
2009-08-12 16:18:09 BST LOG:  could not send data to client: Broken pipe
2009-08-12 16:18:09 BST LOG:  could not receive data from client:
Connection reset by peer
2009-08-12 16:18:09 BST LOG:  unexpected EOF on client connection



--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  

Re: [Bacula-users] Timeout (?) problems with some Full backups

2009-08-12 Thread John Drescher
 Sorry to trouble you with what's probably a simple problem, but I'm now
 looking at the very real possibility of wiping all our backups clean and
 starting from scratch if I can't fix it... :(

Its highly doubtful that this method will help.

 I'm having problems with some Full backups, which run for between 1 and
 2 hours, appearing to time out after the data transfer from the FD to
 the SD. The error message (shown below) shows that the data transfer
 completes, often in about 1hr30min, and then Bacula does nothing until
 the job has been running for 2 hours at which point it gives an FD
 error.

 Other Full backups (which don't take as long) run correctly, and for
 most of the time Inc and Diff backups also run correctly. However, a
 small % of backups will fail at random, also with FD errors but at
 random times-elapsed during the job... this I have been ascribing to
 network fluctuations! The difference is that re-running these random
 failures will succeed, whilst this particular Full failure doesn't! ;)

 I've already tried setting a heartbeat interval of 20 minutes in the
 FD/SD and DIR conf files (thinking that the FD - Dir connection was
 timing out) but this doesn't change anything.

 In the time between the data transfer finishing and the timeout,
 Postgres has an open connection with a COPY batch FROM STDIN
 transaction in progress, which at the timeout produces errors in the
 Postgres log that I have also shown below.

 I'm happy to post portions of the conf files if needed, but they're huge
 and might well lead to tl;dr!

 Any suggestions as to how I can troubleshoot this further would be most
 appreciated!


I would make the heartbeat interval much shorter.

Also I am interested on why the backup rate is this slow. Slow network?

John

--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Timeout (?) problems with some Full backups

2009-08-12 Thread John Lockard
While the job is running, keep an eye on the system which houses
your MySQL database and make sure that it isn't filling up a
partition with temp data.  I was running into a similar problem
and needed to move my mysql_tmpdir (definable in /etc/my.cnf)
to another location.

-John

On Wed, Aug 12, 2009 at 05:00:30PM +0100, Nick Lock wrote:
 Hello list!
 
 Sorry to trouble you with what's probably a simple problem, but I'm now
 looking at the very real possibility of wiping all our backups clean and
 starting from scratch if I can't fix it... :(
 
 I'm having problems with some Full backups, which run for between 1 and
 2 hours, appearing to time out after the data transfer from the FD to
 the SD. The error message (shown below) shows that the data transfer
 completes, often in about 1hr30min, and then Bacula does nothing until
 the job has been running for 2 hours at which point it gives an FD
 error.
 
 Other Full backups (which don't take as long) run correctly, and for
 most of the time Inc and Diff backups also run correctly. However, a
 small % of backups will fail at random, also with FD errors but at
 random times-elapsed during the job... this I have been ascribing to
 network fluctuations! The difference is that re-running these random
 failures will succeed, whilst this particular Full failure doesn't! ;)
 
 I've already tried setting a heartbeat interval of 20 minutes in the
 FD/SD and DIR conf files (thinking that the FD - Dir connection was
 timing out) but this doesn't change anything.
 
 In the time between the data transfer finishing and the timeout,
 Postgres has an open connection with a COPY batch FROM STDIN
 transaction in progress, which at the timeout produces errors in the
 Postgres log that I have also shown below.
 
 I'm happy to post portions of the conf files if needed, but they're huge
 and might well lead to tl;dr!
 
 Any suggestions as to how I can troubleshoot this further would be most
 appreciated!
 
 Nick Lock.
 
 
 -
 12-Aug 14:18 exa-bacula-dir JobId 5514: Start Backup JobId 5514,
 Job=backup_scavenger.2009-08-12_14.18.06.04
 12-Aug 14:18 exa-bacula-dir JobId 5514: There are no more Jobs
 associated with Volume scavenger-full-1250. Marking it purged.
 12-Aug 14:18 exa-bacula-dir JobId 5514: All records pruned from Volume
 scavenger-full-1250; marking it Purged
 12-Aug 14:18 exa-bacula-dir JobId 5514: Recycled volume
 scavenger-full-1250
 12-Aug 14:18 exa-bacula-dir JobId 5514: Using Device
 FileStorageScavenger
 12-Aug 14:18 exa-bacula-sd JobId 5514: Recycled volume
 scavenger-full-1250 on device
 FileStorageScavenger (/srv/bacula/volume/web-scavenger), all previous
 data lost.
 12-Aug 14:18 exa-bacula-dir JobId 5514: Max Volume jobs exceeded.
 Marking Volume scavenger-full-1250 as Used.
 12-Aug 15:49 exa-bacula-sd JobId 5514: Job write elapsed time =
 01:31:41, Transfer rate = 401.4 K bytes/second
 12-Aug 16:18 exa-bacula-dir JobId 5514: Fatal error: Network error with
 FD during Backup: ERR=Connection reset by peer
 12-Aug 16:18 exa-bacula-dir JobId 5514: Fatal error: No Job status
 returned from FD.
 12-Aug 16:18 exa-bacula-dir JobId 5514: Error: Bacula exa-bacula-dir
 2.4.4 (28Dec08): 12-Aug-2009 16:18:09
   Build OS:   x86_64-pc-linux-gnu debian lenny/sid
   JobId:  5514
   Job:backup_scavenger.2009-08-12_14.18.06.04
   Backup Level:   Full
   Client: scavenger 2.4.4 (28Dec08)
 i486-pc-linux-gnu,debian,5.0
   FileSet:full-scavenger 2009-04-16 15:58:05
   Pool:   scavenger-full (From Job FullPool override)
   Storage:FileScavenger (From Job resource)
   Scheduled time: 12-Aug-2009 14:18:03
   Start time: 12-Aug-2009 14:18:09
   End time:   12-Aug-2009 16:18:09
   Elapsed time:   2 hours 
   Priority:   10
   FD Files Written:   0
   SD Files Written:   81,883
   FD Bytes Written:   0 (0 B)
   SD Bytes Written:   2,208,578,175 (2.208 GB)
   Rate:   0.0 KB/s
   Software Compression:   None
   VSS:no
   Storage Encryption: no
   Volume name(s): scavenger-full-1250
   Volume Session Id:  5
   Volume Session Time:1250080970
   Last Volume Bytes:  2,212,857,316 (2.212 GB)
   Non-fatal FD errors:0
   SD Errors:  0
   FD termination status:  Error
   SD termination status:  OK
   Termination:*** Backup Error ***
 
 -
 Postgres Log:
 
 2009-08-12 16:18:09 BST ERROR:  unexpected message type 0x58 during COPY
 from stdin
 2009-08-12 16:18:09 BST CONTEXT:  COPY batch, line 81884: 
 2009-08-12 16:18:09 BST STATEMENT:  COPY batch FROM STDIN
 2009-08-12 16:18:09 BST LOG:  could not send data to client: Broken pipe
 2009-08-12 16:18:09 BST LOG:  could not receive data from