Hello Alex,

On Wednesday 04 November 2009 15:01:45 Alex Bramley wrote:
> Hi again,
>
> I forced the errors we've been chasing to re-occur with my patched-up
> bacula-3.0.3 install, by reducing PostgreSQL's maximum connections to
> 4 and running 12 backup jobs simultaneously-ish (started with a bash
> for loop piped into bconsole). I can confirm that the PostgreSQL error
> is being logged correctly now, but I'm not 100% sure it's being
> handled correctly.
>
> Of the 12 jobs started, 6 completed successfully, three correctly
> cancelled themselves due to being unable to establish a connection to
> PostgreSQL, and three are currently still classed by the director as
> "Running" though they are in the same "Fatal Error" state as usual.
> One of these three cannot be cancelled as the director says:
>
> 2901 Job rmarst-desktop.2009-11-04_12.10.10_06 not found.
> 3904 Job rmarst-desktop.2009-11-04_12.10.10_06 not found.
>
> The other two cause the same bconsole "hang" as seen before when I
> attempt to cancel them.
>
> After restarting the SD, one of the three jobs sucessfully
> transitioned away from the "Running" state, the other two cannot be
> cancelled in the same manner as above. After restarting the director,
> these jobs vanished without a trace from the console, but their errors
> were logged into backup.log.
>
> Here is an example of the error log of one of the jobs that cancelled
> successfully:
>
> 04-Nov 12:10 bksrv0-dir JobId 327: Start Backup JobId 327,
> Job=graham-desktop.2009-11-04_12.10.12_20
> 04-Nov 12:10 bksrv0-dir JobId 327: Using Device "graham-desktop"
> 04-Nov 12:10 bksrv0-sd JobId 327: Volume "graham-desktop-0099"
> previously written, moving to end of data.
> 04-Nov 12:10 bksrv0-sd JobId 327: Ready to append to end of Volume
> "graham-desktop-0099" size=7033783439
> 04-Nov 12:11 bksrv0-dir JobId 327: Fatal error: sql.c:748 sql.c:747
> Could not open database "bacula": ERR=postgresql.c:234 Unable to
> connect to PostgreSQL server.
> Database=bacula User=bacula
> It is probably not running or your password is incorrect.
> 04-Nov 12:11 bksrv0-dir JobId 327: Fatal error: catreq.c:488 Attribute
> create error. Query failed: DROP TABLE DelCandidates: ERR=ERROR:
> table "delcandidates" does not exist
> 04-Nov 12:11 bksrv0-sd JobId 327: Job
> graham-desktop.2009-11-04_12.10.12_20 marked to be canceled.
> 04-Nov 12:11 bksrv0-sd JobId 327: Fatal error: fd_cmds.c:177 FD
> command not found: 112 1 0
> 04-Nov 12:11 bksrv0-sd JobId 327: Job write elapsed time = 00:01:35,
> Transfer rate = 348.2 K bytes/second
> 04-Nov 12:11 bksrv0-sd JobId 327: Fatal error: append.c:292 Fatal
> append error on device "graham-desktop"
> (/backup/volumes/graham-desktop/): ERR=
> 04-Nov 12:11 bksrv0-sd JobId 327: Fatal error: fd_cmds.c:166 Command
> error with FD, hanging up. Append data error.
> 04-Nov 12:11 graham-desktop JobId 327: Fatal error: backup.c:964
> Network send error to SD. ERR=Connection reset by peer
>
> Each of the three other jobs had different error messages caused by
> the restart of the storage daemon:
>
> 04-Nov 13:27 richard-desktop JobId 325: Fatal error: backup.c:1108
> Network send error to SD. ERR=Input/output error
> 04-Nov 13:47 bksrv0-dir JobId 325: Fatal error: bsock.c:488 Packet
> size too big from "Storage daemon:bksrv0:9103. Terminating connection.
>
> 04-Nov 13:23 norman-desktop JobId 322: Fatal error: backup.c:964
> Network send error to SD. ERR=Input/output error
> 04-Nov 13:47 bksrv0-sd JobId 322: Fatal error: append.c:243 Network
> error on data channel. ERR=Connection reset by peer
> 04-Nov 13:47 bksrv0-sd JobId 322: Job write elapsed time = 01:36:49,
> Transfer rate = 37  bytes/second
> 04-Nov 13:47 bksrv0-sd JobId 322: Fatal error: append.c:292 Fatal
> append error on device "norman-desktop"
> (/backup/volumes/norman-desktop/): ERR=
> 04-Nov 13:47 bksrv0-dir JobId 322: Error: bsock.c:518 Read error from
> Storage daemon:bksrv0:9103: ERR=No data available
>
> 04-Nov 12:10 bksrv0-sd JobId 320: Job
> rmarst-desktop.2009-11-04_12.10.10_06 marked to be canceled.
> 04-Nov 12:18 bksrv0-sd JobId 320: Fatal error: fd_cmds.c:177 FD
> command not found: 12 7 0
> 04-Nov 12:18 bksrv0-sd JobId 320: Job write elapsed time = 00:07:52,
> Transfer rate = 3  bytes/second
> 04-Nov 12:18 bksrv0-sd JobId 320: Fatal error: append.c:292 Fatal
> append error on device "rmarst-desktop"
> (/backup/volumes/rmarst-desktop/): ERR=
> 04-Nov 12:18 bksrv0-sd JobId 320: Fatal error: fd_cmds.c:166 Command
> error with FD, hanging up. Append data error.
> 04-Nov 12:18 rmarst-desktop JobId 320: Fatal error: backup.c:1068
> Network send error to SD. ERR=Connection reset by peer
>
> I'm not sure if all this is useful information. If there's anything
> else you'd like me to try to help narrow down what's going on, just
> let me know!

Yes, this is very useful.  It is not often that I am able to see a series of 
cascading errors generated by a real "database" error, so it gave me a chance 
to see how many times an error message is repeated, and where it gets 
distorted because the job thread must continue to the end but avoid trying to 
do anything that will cause another "false" error message.

I think I have cleaned up a good part of these error messages, but what is 
worrying me is that you say that Jobs still got stuck in the SD.  So, what 
would be most useful would be for you to tell me the *exact*  PostgreSQL 
config statement (including where it is) that I must change to invoke this 
error for documentation purposes.  I am going to add debug code to Bacula 
force the error by allowing a maximum of 2 and trying to start 10 jobs.  If I 
can duplicate the jobs getting "stuck", I can probably completely resolve it.

I'll be submitting some more patches to clean the error handling up a lot 
more, but I wouldn't recommend at this point that you attempt to take them. 

I recommend sticking with what you have and either going back to 3.0.3 or 
preferrably testing the patch carefully before putting it into production. 

If I find out how to "unstick" the stuck jobs, I will let you know.

Regards,

Kern

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to