Hi again,

I forced the errors we've been chasing to re-occur with my patched-up
bacula-3.0.3 install, by reducing PostgreSQL's maximum connections to
4 and running 12 backup jobs simultaneously-ish (started with a bash
for loop piped into bconsole). I can confirm that the PostgreSQL error
is being logged correctly now, but I'm not 100% sure it's being
handled correctly.

Of the 12 jobs started, 6 completed successfully, three correctly
cancelled themselves due to being unable to establish a connection to
PostgreSQL, and three are currently still classed by the director as
"Running" though they are in the same "Fatal Error" state as usual.
One of these three cannot be cancelled as the director says:

2901 Job rmarst-desktop.2009-11-04_12.10.10_06 not found.
3904 Job rmarst-desktop.2009-11-04_12.10.10_06 not found.

The other two cause the same bconsole "hang" as seen before when I
attempt to cancel them.

After restarting the SD, one of the three jobs sucessfully
transitioned away from the "Running" state, the other two cannot be
cancelled in the same manner as above. After restarting the director,
these jobs vanished without a trace from the console, but their errors
were logged into backup.log.

Here is an example of the error log of one of the jobs that cancelled
successfully:

04-Nov 12:10 bksrv0-dir JobId 327: Start Backup JobId 327,
Job=graham-desktop.2009-11-04_12.10.12_20
04-Nov 12:10 bksrv0-dir JobId 327: Using Device "graham-desktop"
04-Nov 12:10 bksrv0-sd JobId 327: Volume "graham-desktop-0099"
previously written, moving to end of data.
04-Nov 12:10 bksrv0-sd JobId 327: Ready to append to end of Volume
"graham-desktop-0099" size=7033783439
04-Nov 12:11 bksrv0-dir JobId 327: Fatal error: sql.c:748 sql.c:747
Could not open database "bacula": ERR=postgresql.c:234 Unable to
connect to PostgreSQL server.
Database=bacula User=bacula
It is probably not running or your password is incorrect.
04-Nov 12:11 bksrv0-dir JobId 327: Fatal error: catreq.c:488 Attribute
create error. Query failed: DROP TABLE DelCandidates: ERR=ERROR:
table "delcandidates" does not exist
04-Nov 12:11 bksrv0-sd JobId 327: Job
graham-desktop.2009-11-04_12.10.12_20 marked to be canceled.
04-Nov 12:11 bksrv0-sd JobId 327: Fatal error: fd_cmds.c:177 FD
command not found: 112 1 0
04-Nov 12:11 bksrv0-sd JobId 327: Job write elapsed time = 00:01:35,
Transfer rate = 348.2 K bytes/second
04-Nov 12:11 bksrv0-sd JobId 327: Fatal error: append.c:292 Fatal
append error on device "graham-desktop"
(/backup/volumes/graham-desktop/): ERR=
04-Nov 12:11 bksrv0-sd JobId 327: Fatal error: fd_cmds.c:166 Command
error with FD, hanging up. Append data error.
04-Nov 12:11 graham-desktop JobId 327: Fatal error: backup.c:964
Network send error to SD. ERR=Connection reset by peer

Each of the three other jobs had different error messages caused by
the restart of the storage daemon:

04-Nov 13:27 richard-desktop JobId 325: Fatal error: backup.c:1108
Network send error to SD. ERR=Input/output error
04-Nov 13:47 bksrv0-dir JobId 325: Fatal error: bsock.c:488 Packet
size too big from "Storage daemon:bksrv0:9103. Terminating connection.

04-Nov 13:23 norman-desktop JobId 322: Fatal error: backup.c:964
Network send error to SD. ERR=Input/output error
04-Nov 13:47 bksrv0-sd JobId 322: Fatal error: append.c:243 Network
error on data channel. ERR=Connection reset by peer
04-Nov 13:47 bksrv0-sd JobId 322: Job write elapsed time = 01:36:49,
Transfer rate = 37  bytes/second
04-Nov 13:47 bksrv0-sd JobId 322: Fatal error: append.c:292 Fatal
append error on device "norman-desktop"
(/backup/volumes/norman-desktop/): ERR=
04-Nov 13:47 bksrv0-dir JobId 322: Error: bsock.c:518 Read error from
Storage daemon:bksrv0:9103: ERR=No data available

04-Nov 12:10 bksrv0-sd JobId 320: Job
rmarst-desktop.2009-11-04_12.10.10_06 marked to be canceled.
04-Nov 12:18 bksrv0-sd JobId 320: Fatal error: fd_cmds.c:177 FD
command not found: 12 7 0
04-Nov 12:18 bksrv0-sd JobId 320: Job write elapsed time = 00:07:52,
Transfer rate = 3  bytes/second
04-Nov 12:18 bksrv0-sd JobId 320: Fatal error: append.c:292 Fatal
append error on device "rmarst-desktop"
(/backup/volumes/rmarst-desktop/): ERR=
04-Nov 12:18 bksrv0-sd JobId 320: Fatal error: fd_cmds.c:166 Command
error with FD, hanging up. Append data error.
04-Nov 12:18 rmarst-desktop JobId 320: Fatal error: backup.c:1068
Network send error to SD. ERR=Connection reset by peer

I'm not sure if all this is useful information. If there's anything
else you'd like me to try to help narrow down what's going on, just
let me know!

--Alex

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to