Hello Alex, On Wednesday 04 November 2009 15:01:45 Alex Bramley wrote: > Hi again, > > I forced the errors we've been chasing to re-occur with my patched-up > bacula-3.0.3 install, by reducing PostgreSQL's maximum connections to > 4 and running 12 backup jobs simultaneously-ish (started with a bash > for loop piped into bconsole). I can confirm that the PostgreSQL error > is being logged correctly now, but I'm not 100% sure it's being > handled correctly. > > Of the 12 jobs started, 6 completed successfully, three correctly > cancelled themselves due to being unable to establish a connection to > PostgreSQL, and three are currently still classed by the director as > "Running" though they are in the same "Fatal Error" state as usual. > One of these three cannot be cancelled as the director says: > > 2901 Job rmarst-desktop.2009-11-04_12.10.10_06 not found. > 3904 Job rmarst-desktop.2009-11-04_12.10.10_06 not found. > > The other two cause the same bconsole "hang" as seen before when I > attempt to cancel them. > > After restarting the SD, one of the three jobs sucessfully > transitioned away from the "Running" state, the other two cannot be > cancelled in the same manner as above. After restarting the director, > these jobs vanished without a trace from the console, but their errors > were logged into backup.log. > > Here is an example of the error log of one of the jobs that cancelled > successfully: > > 04-Nov 12:10 bksrv0-dir JobId 327: Start Backup JobId 327, > Job=graham-desktop.2009-11-04_12.10.12_20 > 04-Nov 12:10 bksrv0-dir JobId 327: Using Device "graham-desktop" > 04-Nov 12:10 bksrv0-sd JobId 327: Volume "graham-desktop-0099" > previously written, moving to end of data. > 04-Nov 12:10 bksrv0-sd JobId 327: Ready to append to end of Volume > "graham-desktop-0099" size=7033783439 > 04-Nov 12:11 bksrv0-dir JobId 327: Fatal error: sql.c:748 sql.c:747 > Could not open database "bacula": ERR=postgresql.c:234 Unable to > connect to PostgreSQL server. > Database=bacula User=bacula > It is probably not running or your password is incorrect. > 04-Nov 12:11 bksrv0-dir JobId 327: Fatal error: catreq.c:488 Attribute > create error. Query failed: DROP TABLE DelCandidates: ERR=ERROR: > table "delcandidates" does not exist > 04-Nov 12:11 bksrv0-sd JobId 327: Job > graham-desktop.2009-11-04_12.10.12_20 marked to be canceled. > 04-Nov 12:11 bksrv0-sd JobId 327: Fatal error: fd_cmds.c:177 FD > command not found: 112 1 0 > 04-Nov 12:11 bksrv0-sd JobId 327: Job write elapsed time = 00:01:35, > Transfer rate = 348.2 K bytes/second > 04-Nov 12:11 bksrv0-sd JobId 327: Fatal error: append.c:292 Fatal > append error on device "graham-desktop" > (/backup/volumes/graham-desktop/): ERR= > 04-Nov 12:11 bksrv0-sd JobId 327: Fatal error: fd_cmds.c:166 Command > error with FD, hanging up. Append data error. > 04-Nov 12:11 graham-desktop JobId 327: Fatal error: backup.c:964 > Network send error to SD. ERR=Connection reset by peer > > Each of the three other jobs had different error messages caused by > the restart of the storage daemon: > > 04-Nov 13:27 richard-desktop JobId 325: Fatal error: backup.c:1108 > Network send error to SD. ERR=Input/output error > 04-Nov 13:47 bksrv0-dir JobId 325: Fatal error: bsock.c:488 Packet > size too big from "Storage daemon:bksrv0:9103. Terminating connection. > > 04-Nov 13:23 norman-desktop JobId 322: Fatal error: backup.c:964 > Network send error to SD. ERR=Input/output error > 04-Nov 13:47 bksrv0-sd JobId 322: Fatal error: append.c:243 Network > error on data channel. ERR=Connection reset by peer > 04-Nov 13:47 bksrv0-sd JobId 322: Job write elapsed time = 01:36:49, > Transfer rate = 37 bytes/second > 04-Nov 13:47 bksrv0-sd JobId 322: Fatal error: append.c:292 Fatal > append error on device "norman-desktop" > (/backup/volumes/norman-desktop/): ERR= > 04-Nov 13:47 bksrv0-dir JobId 322: Error: bsock.c:518 Read error from > Storage daemon:bksrv0:9103: ERR=No data available > > 04-Nov 12:10 bksrv0-sd JobId 320: Job > rmarst-desktop.2009-11-04_12.10.10_06 marked to be canceled. > 04-Nov 12:18 bksrv0-sd JobId 320: Fatal error: fd_cmds.c:177 FD > command not found: 12 7 0 > 04-Nov 12:18 bksrv0-sd JobId 320: Job write elapsed time = 00:07:52, > Transfer rate = 3 bytes/second > 04-Nov 12:18 bksrv0-sd JobId 320: Fatal error: append.c:292 Fatal > append error on device "rmarst-desktop" > (/backup/volumes/rmarst-desktop/): ERR= > 04-Nov 12:18 bksrv0-sd JobId 320: Fatal error: fd_cmds.c:166 Command > error with FD, hanging up. Append data error. > 04-Nov 12:18 rmarst-desktop JobId 320: Fatal error: backup.c:1068 > Network send error to SD. ERR=Connection reset by peer > > I'm not sure if all this is useful information. If there's anything > else you'd like me to try to help narrow down what's going on, just > let me know!
Yes, this is very useful. It is not often that I am able to see a series of cascading errors generated by a real "database" error, so it gave me a chance to see how many times an error message is repeated, and where it gets distorted because the job thread must continue to the end but avoid trying to do anything that will cause another "false" error message. I think I have cleaned up a good part of these error messages, but what is worrying me is that you say that Jobs still got stuck in the SD. So, what would be most useful would be for you to tell me the *exact* PostgreSQL config statement (including where it is) that I must change to invoke this error for documentation purposes. I am going to add debug code to Bacula force the error by allowing a maximum of 2 and trying to start 10 jobs. If I can duplicate the jobs getting "stuck", I can probably completely resolve it. I'll be submitting some more patches to clean the error handling up a lot more, but I wouldn't recommend at this point that you attempt to take them. I recommend sticking with what you have and either going back to 3.0.3 or preferrably testing the patch carefully before putting it into production. If I find out how to "unstick" the stuck jobs, I will let you know. Regards, Kern ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Bacula-devel mailing list Bacula-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-devel