On 2020-06-26 17:58, Phil Stracchino wrote:
> Oh, another detail I learned this morning:
>
> Cancelling a SINGLE hung job does not crash the Director. It appears it
> is only attempting to cancel a SECOND hung job that causes the Director
> to crash.
After four days of normal operation I got another hung job this morning,
still with HAproxy not in use. So it is *much less* prevalent without
HAproxy but still occurring (one failure every 3-4 days instead of 2-3
failed jobs per day). This is the same running Director process as
handled the failure mentioned above, and the Director was frozen and
unresponsive for nearly ten minutes attempting to cancel the second job
before it finally marked it as cancelled. (But the job still shows as
running and with a fatal error.)
====
*cancel
Select Job(s):
1: JobId=25114 Job=Narn_Backup.2020-07-01_04.30.00_37
2: JobId=25115 Job=MySQL_Backup_New.2020-07-01_04.55.00_39
Choose Job list to cancel (1-2): 1
JobId=25114 Job=Narn_Backup.2020-07-01_04.30.00_37
Confirm cancel of 1 Job (yes/no): yes
Failed to connect to File daemon.
3000 JobId=25114 Job="Narn_Backup.2020-07-01_04.30.00_37" marked to be
canceled.
You have messages.
*status dir
minbar-dir Version: 9.6.5 (11 June 2020) x86_64-pc-linux-gnu gentoo
Daemon started 27-Jun-20 01:59, conf reloaded 27-Jun-2020 01:59:13
Jobs: run=34, running=2 mode=0,0
Heap: heap=544,768 smbytes=215,764 max_bytes=372,778 bufs=914
max_bufs=1,140
Res: njobs=11 nclients=6 nstores=2 npools=5 ncats=1 nfsets=11 nscheds=6
Scheduled Jobs:
Level Type Pri Scheduled Job Name Volume
===================================================================================
Incremental Backup 10 02-Jul-20 04:30 Asgard Backup *unknown*
Incremental Backup 10 02-Jul-20 04:30 Narn Backup *unknown*
Incremental Backup 10 02-Jul-20 04:30 Minbar Backup *unknown*
Incremental Backup 10 02-Jul-20 04:30 Fisherprice Backup *unknown*
Incremental Backup 10 02-Jul-20 04:30 Babylon5 Backup *unknown*
Incremental Backup 10 02-Jul-20 04:30 Netstore Backup *unknown*
Incremental Backup 15 02-Jul-20 04:55 MySQL Backup New *unknown*
====
Running Jobs:
Console connected at 01-Jul-20 09:35
Console connected at 01-Jul-20 15:34
JobId Type Level Files Bytes Name Status
======================================================================
25114 Back Incr 0 0 Narn Backup has a fatal error
25115 Back Incr 0 0 MySQL Backup New is waiting for
higher priority jobs to finish
====
Terminated Jobs:
JobId Level Files Bytes Status Finished Name
====================================================================
25102 Incr 35 48.84 M OK 30-Jun-20 04:30 Asgard_Backup
25107 Incr 1,727 1.940 G OK 30-Jun-20 04:30 Narn_Backup
25106 Incr 3,020 2.799 G OK 30-Jun-20 04:30 Minbar_Backup
25104 Incr 5,392 5.551 G OK 30-Jun-20 04:31
Babylon5_Backup
25108 Incr 281 2.459 G OK 30-Jun-20 05:00
MySQL_Backup_New
25110 Incr 0 0 OK 01-Jul-20 04:30
Netstore_Backup
25112 Incr 237 415.4 M OK 01-Jul-20 04:30
Fisherprice_Backup
25109 Incr 34 48.84 M OK 01-Jul-20 04:30 Asgard_Backup
25113 Incr 7,072 3.757 G OK 01-Jul-20 04:30 Minbar_Backup
25111 Incr 10,306 5.406 G OK 01-Jul-20 04:31
Babylon5_Backup
====
Martin, you're still convinced this has to do with the SD? I'm still
working on trying to figure out how I can get dbx installed so that I
can backtrace the SD. pstack isn't very informative:
asgard:root:~:1 # pstack 10971
10971: /opt/bacula/sbin/bacula-sd -v -c /opt/bacula/etc/bacula-sd.conf
------------ lwp# 1 / thread# 1 ---------------
ffff80ffbf580daa pollsys (ffff80ffbfffcd10, 1, 0, 0)
ffff80ffbf525755 pselect () + 181
ffff80ffbf525bd4 select () + 68
ffff80ffb5211023
__1cSbnet_thread_server6FpnFdlist_ipnJworkq_tag_pFpv_4_v_ () + 963
0000000000419c84 main () + 724
3f763a7574735070 ???????? ()
------------ lwp# 3 / thread# 3 ---------------
ffff80ffbf577e97 lwp_park (0, ffff80ffbe5bbe30, 0)
ffff80ffbf5716fa cond_wait_queue () + 62
ffff80ffbf571b38 cond_wait_common () + 1dc
ffff80ffbf571dc7 __cond_timedwait () + a7
ffff80ffbf571e11 cond_timedwait () + 29
ffff80ffbf571e45 pthread_cond_timedwait () + 9
ffff80ffb526a41a watchdog_thread () + 57a
------------ lwp# 70 / thread# 70 ---------------
ffff80ffbf580e7a read (10, ffff80ffbd3c495c, 4)
ffff80ffb524fdbe __1cJBSOCKCORELread_nbytes6Mpci_i_ () + 4e
--
Phil Stracchino
Babylon Communications
[email protected]
[email protected]
Landline: +1.603.293.8485
Mobile: +1.603.998.6958
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel