On 2020-06-26 17:58, Phil Stracchino wrote:
> Oh, another detail I learned this morning:
> 
> Cancelling a SINGLE hung job does not crash the Director.  It appears it
> is only attempting to cancel a SECOND hung job that causes the Director
> to crash.


After four days of normal operation I got another hung job this morning,
still with HAproxy not in use.  So it is *much less* prevalent without
HAproxy but still occurring (one failure every 3-4 days instead of 2-3
failed jobs per day).  This is the same running Director process as
handled the failure mentioned above, and the Director was frozen and
unresponsive for nearly ten minutes attempting to cancel the second job
before it finally marked it as cancelled.  (But the job still shows as
running and with a fatal error.)


====
*cancel
Select Job(s):
     1: JobId=25114 Job=Narn_Backup.2020-07-01_04.30.00_37
     2: JobId=25115 Job=MySQL_Backup_New.2020-07-01_04.55.00_39
Choose Job list to cancel (1-2): 1
JobId=25114 Job=Narn_Backup.2020-07-01_04.30.00_37
Confirm cancel of 1 Job (yes/no): yes
Failed to connect to File daemon.
3000 JobId=25114 Job="Narn_Backup.2020-07-01_04.30.00_37" marked to be
canceled.
You have messages.
*status dir
minbar-dir Version: 9.6.5 (11 June 2020) x86_64-pc-linux-gnu gentoo
Daemon started 27-Jun-20 01:59, conf reloaded 27-Jun-2020 01:59:13
 Jobs: run=34, running=2 mode=0,0
 Heap: heap=544,768 smbytes=215,764 max_bytes=372,778 bufs=914
max_bufs=1,140
 Res: njobs=11 nclients=6 nstores=2 npools=5 ncats=1 nfsets=11 nscheds=6

Scheduled Jobs:
Level          Type     Pri  Scheduled          Job Name           Volume
===================================================================================
Incremental    Backup    10  02-Jul-20 04:30    Asgard Backup      *unknown*
Incremental    Backup    10  02-Jul-20 04:30    Narn Backup        *unknown*
Incremental    Backup    10  02-Jul-20 04:30    Minbar Backup      *unknown*
Incremental    Backup    10  02-Jul-20 04:30    Fisherprice Backup *unknown*
Incremental    Backup    10  02-Jul-20 04:30    Babylon5 Backup    *unknown*
Incremental    Backup    10  02-Jul-20 04:30    Netstore Backup    *unknown*
Incremental    Backup    15  02-Jul-20 04:55    MySQL Backup New   *unknown*
====

Running Jobs:
Console connected at 01-Jul-20 09:35
Console connected at 01-Jul-20 15:34
 JobId  Type Level     Files     Bytes  Name              Status
======================================================================
 25114  Back Incr          0         0  Narn Backup       has a fatal error
 25115  Back Incr          0         0  MySQL Backup New  is waiting for
higher priority jobs to finish
====

Terminated Jobs:
 JobId  Level      Files    Bytes   Status   Finished        Name
====================================================================
 25102  Incr          35    48.84 M  OK       30-Jun-20 04:30 Asgard_Backup
 25107  Incr       1,727    1.940 G  OK       30-Jun-20 04:30 Narn_Backup
 25106  Incr       3,020    2.799 G  OK       30-Jun-20 04:30 Minbar_Backup
 25104  Incr       5,392    5.551 G  OK       30-Jun-20 04:31
Babylon5_Backup
 25108  Incr         281    2.459 G  OK       30-Jun-20 05:00
MySQL_Backup_New
 25110  Incr           0         0   OK       01-Jul-20 04:30
Netstore_Backup
 25112  Incr         237    415.4 M  OK       01-Jul-20 04:30
Fisherprice_Backup
 25109  Incr          34    48.84 M  OK       01-Jul-20 04:30 Asgard_Backup
 25113  Incr       7,072    3.757 G  OK       01-Jul-20 04:30 Minbar_Backup
 25111  Incr      10,306    5.406 G  OK       01-Jul-20 04:31
Babylon5_Backup

====


Martin, you're still convinced this has to do with the SD?  I'm still
working on trying to figure out how I can get dbx installed so that I
can backtrace the SD.  pstack isn't very informative:

asgard:root:~:1 # pstack 10971
10971:  /opt/bacula/sbin/bacula-sd -v -c /opt/bacula/etc/bacula-sd.conf
------------  lwp# 1 / thread# 1  ---------------
 ffff80ffbf580daa pollsys  (ffff80ffbfffcd10, 1, 0, 0)
 ffff80ffbf525755 pselect () + 181
 ffff80ffbf525bd4 select () + 68
 ffff80ffb5211023
__1cSbnet_thread_server6FpnFdlist_ipnJworkq_tag_pFpv_4_v_ () + 963
 0000000000419c84 main () + 724
 3f763a7574735070 ???????? ()
------------  lwp# 3 / thread# 3  ---------------
 ffff80ffbf577e97 lwp_park (0, ffff80ffbe5bbe30, 0)
 ffff80ffbf5716fa cond_wait_queue () + 62
 ffff80ffbf571b38 cond_wait_common () + 1dc
 ffff80ffbf571dc7 __cond_timedwait () + a7
 ffff80ffbf571e11 cond_timedwait () + 29
 ffff80ffbf571e45 pthread_cond_timedwait () + 9
 ffff80ffb526a41a watchdog_thread () + 57a
------------  lwp# 70 / thread# 70  ---------------
 ffff80ffbf580e7a read     (10, ffff80ffbd3c495c, 4)
 ffff80ffb524fdbe __1cJBSOCKCORELread_nbytes6Mpci_i_ () + 4e






-- 
  Phil Stracchino
  Babylon Communications
  ph...@caerllewys.net
  p...@co.ordinate.org
  Landline: +1.603.293.8485
  Mobile:   +1.603.998.6958


_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to