On 2020-06-26 17:58, Phil Stracchino wrote: > Oh, another detail I learned this morning: > > Cancelling a SINGLE hung job does not crash the Director. It appears it > is only attempting to cancel a SECOND hung job that causes the Director > to crash.
After four days of normal operation I got another hung job this morning, still with HAproxy not in use. So it is *much less* prevalent without HAproxy but still occurring (one failure every 3-4 days instead of 2-3 failed jobs per day). This is the same running Director process as handled the failure mentioned above, and the Director was frozen and unresponsive for nearly ten minutes attempting to cancel the second job before it finally marked it as cancelled. (But the job still shows as running and with a fatal error.) ==== *cancel Select Job(s): 1: JobId=25114 Job=Narn_Backup.2020-07-01_04.30.00_37 2: JobId=25115 Job=MySQL_Backup_New.2020-07-01_04.55.00_39 Choose Job list to cancel (1-2): 1 JobId=25114 Job=Narn_Backup.2020-07-01_04.30.00_37 Confirm cancel of 1 Job (yes/no): yes Failed to connect to File daemon. 3000 JobId=25114 Job="Narn_Backup.2020-07-01_04.30.00_37" marked to be canceled. You have messages. *status dir minbar-dir Version: 9.6.5 (11 June 2020) x86_64-pc-linux-gnu gentoo Daemon started 27-Jun-20 01:59, conf reloaded 27-Jun-2020 01:59:13 Jobs: run=34, running=2 mode=0,0 Heap: heap=544,768 smbytes=215,764 max_bytes=372,778 bufs=914 max_bufs=1,140 Res: njobs=11 nclients=6 nstores=2 npools=5 ncats=1 nfsets=11 nscheds=6 Scheduled Jobs: Level Type Pri Scheduled Job Name Volume =================================================================================== Incremental Backup 10 02-Jul-20 04:30 Asgard Backup *unknown* Incremental Backup 10 02-Jul-20 04:30 Narn Backup *unknown* Incremental Backup 10 02-Jul-20 04:30 Minbar Backup *unknown* Incremental Backup 10 02-Jul-20 04:30 Fisherprice Backup *unknown* Incremental Backup 10 02-Jul-20 04:30 Babylon5 Backup *unknown* Incremental Backup 10 02-Jul-20 04:30 Netstore Backup *unknown* Incremental Backup 15 02-Jul-20 04:55 MySQL Backup New *unknown* ==== Running Jobs: Console connected at 01-Jul-20 09:35 Console connected at 01-Jul-20 15:34 JobId Type Level Files Bytes Name Status ====================================================================== 25114 Back Incr 0 0 Narn Backup has a fatal error 25115 Back Incr 0 0 MySQL Backup New is waiting for higher priority jobs to finish ==== Terminated Jobs: JobId Level Files Bytes Status Finished Name ==================================================================== 25102 Incr 35 48.84 M OK 30-Jun-20 04:30 Asgard_Backup 25107 Incr 1,727 1.940 G OK 30-Jun-20 04:30 Narn_Backup 25106 Incr 3,020 2.799 G OK 30-Jun-20 04:30 Minbar_Backup 25104 Incr 5,392 5.551 G OK 30-Jun-20 04:31 Babylon5_Backup 25108 Incr 281 2.459 G OK 30-Jun-20 05:00 MySQL_Backup_New 25110 Incr 0 0 OK 01-Jul-20 04:30 Netstore_Backup 25112 Incr 237 415.4 M OK 01-Jul-20 04:30 Fisherprice_Backup 25109 Incr 34 48.84 M OK 01-Jul-20 04:30 Asgard_Backup 25113 Incr 7,072 3.757 G OK 01-Jul-20 04:30 Minbar_Backup 25111 Incr 10,306 5.406 G OK 01-Jul-20 04:31 Babylon5_Backup ==== Martin, you're still convinced this has to do with the SD? I'm still working on trying to figure out how I can get dbx installed so that I can backtrace the SD. pstack isn't very informative: asgard:root:~:1 # pstack 10971 10971: /opt/bacula/sbin/bacula-sd -v -c /opt/bacula/etc/bacula-sd.conf ------------ lwp# 1 / thread# 1 --------------- ffff80ffbf580daa pollsys (ffff80ffbfffcd10, 1, 0, 0) ffff80ffbf525755 pselect () + 181 ffff80ffbf525bd4 select () + 68 ffff80ffb5211023 __1cSbnet_thread_server6FpnFdlist_ipnJworkq_tag_pFpv_4_v_ () + 963 0000000000419c84 main () + 724 3f763a7574735070 ???????? () ------------ lwp# 3 / thread# 3 --------------- ffff80ffbf577e97 lwp_park (0, ffff80ffbe5bbe30, 0) ffff80ffbf5716fa cond_wait_queue () + 62 ffff80ffbf571b38 cond_wait_common () + 1dc ffff80ffbf571dc7 __cond_timedwait () + a7 ffff80ffbf571e11 cond_timedwait () + 29 ffff80ffbf571e45 pthread_cond_timedwait () + 9 ffff80ffb526a41a watchdog_thread () + 57a ------------ lwp# 70 / thread# 70 --------------- ffff80ffbf580e7a read (10, ffff80ffbd3c495c, 4) ffff80ffb524fdbe __1cJBSOCKCORELread_nbytes6Mpci_i_ () + 4e -- Phil Stracchino Babylon Communications ph...@caerllewys.net p...@co.ordinate.org Landline: +1.603.293.8485 Mobile: +1.603.998.6958 _______________________________________________ Bacula-devel mailing list Bacula-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-devel