>>>>> On Thu, 2 Jul 2020 14:25:49 -0400, Phil Stracchino said: > > On 2020-07-02 14:10, Martin Simmons wrote: > >>>>>> On Wed, 1 Jul 2020 15:54:27 -0400, Phil Stracchino said: > >> > >> On 2020-06-26 17:58, Phil Stracchino wrote: > >>> Oh, another detail I learned this morning: > >>> > >>> Cancelling a SINGLE hung job does not crash the Director. It appears it > >>> is only attempting to cancel a SECOND hung job that causes the Director > >>> to crash. > >> > >> > >> After four days of normal operation I got another hung job this morning, > >> still with HAproxy not in use. So it is *much less* prevalent without > >> HAproxy but still occurring (one failure every 3-4 days instead of 2-3 > >> failed jobs per day). This is the same running Director process as > >> handled the failure mentioned above, and the Director was frozen and > >> unresponsive for nearly ten minutes attempting to cancel the second job > >> before it finally marked it as cancelled. (But the job still shows as > >> running and with a fatal error.) > > > > I'm confused how it can be the same Director process. Didn't that Director > > crash when you attempted to cancel the second job? > > > No, I had a single hung job which was successfully cancelled (though it > took a long time), and then no more failures until yesterday when there > was again a single hung job (which also took a long time to cancel but > did eventually successfully cancel). It seems as long as I don't try to > cancel two hung jobs at once, there is no crash.
Ah, OK. > > > > > > > >> ==== > >> *cancel > >> Select Job(s): > >> 1: JobId=25114 Job=Narn_Backup.2020-07-01_04.30.00_37 > >> 2: JobId=25115 Job=MySQL_Backup_New.2020-07-01_04.55.00_39 > >> Choose Job list to cancel (1-2): 1 > >> JobId=25114 Job=Narn_Backup.2020-07-01_04.30.00_37 > >> Confirm cancel of 1 Job (yes/no): yes > >> Failed to connect to File daemon. > > > > This failure to connect is strange. > > > > > >> Martin, you're still convinced this has to do with the SD? > > > > I think the hanging jobs are caused by a problem with how Bacula recovers > > from > > an SQL error while inserting attributes into the database. This is not a > > problem with the SD itself, but the hanging does involve the SD so that's > > why > > it would be useful to get simultaneous backtraces from all 3 daemons while > > they are hanging. > > > > > >> I'm still > >> working on trying to figure out how I can get dbx installed so that I > >> can backtrace the SD. pstack isn't very informative: > >> > >> asgard:root:~:1 # pstack 10971 > >> 10971: /opt/bacula/sbin/bacula-sd -v -c /opt/bacula/etc/bacula-sd.conf > >> ------------ lwp# 1 / thread# 1 --------------- > >> ffff80ffbf580daa pollsys (ffff80ffbfffcd10, 1, 0, 0) > >> ffff80ffbf525755 pselect () + 181 > >> ffff80ffbf525bd4 select () + 68 > >> ffff80ffb5211023 > >> __1cSbnet_thread_server6FpnFdlist_ipnJworkq_tag_pFpv_4_v_ () + 963 > >> 0000000000419c84 main () + 724 > >> 3f763a7574735070 ???????? () > >> ------------ lwp# 3 / thread# 3 --------------- > >> ffff80ffbf577e97 lwp_park (0, ffff80ffbe5bbe30, 0) > >> ffff80ffbf5716fa cond_wait_queue () + 62 > >> ffff80ffbf571b38 cond_wait_common () + 1dc > >> ffff80ffbf571dc7 __cond_timedwait () + a7 > >> ffff80ffbf571e11 cond_timedwait () + 29 > >> ffff80ffbf571e45 pthread_cond_timedwait () + 9 > >> ffff80ffb526a41a watchdog_thread () + 57a > >> ------------ lwp# 70 / thread# 70 --------------- > >> ffff80ffbf580e7a read (10, ffff80ffbd3c495c, 4) > >> ffff80ffb524fdbe __1cJBSOCKCORELread_nbytes6Mpci_i_ () + 4e > > > > So it just stops there? That's annoying. > > > > Which compiler and options were used to compile on Solaris? What is the > > setting for CFLAGS in the generated bacula/src/stored/Makefile? > > > Sun Developer Studio 12.6 > > Configure invocation: > > ./configure --prefix=/opt/bacula --with-dump-email=r...@caerllewys.net > --with-job-email=r...@caerllewys.net > --with-smtp-host=smtp.caerllewys.net --with-subsys-dir=/opt/bacula/var > --with-working-dir=/opt/bacula/var --enable-build-stored > --disable-build-dird --enable-smartalloc --with-mysql=/opt/mysql/mysql > CC=/opt/suncc/bin/CC CFLAGS='-fast -xarch=generic -xtarget=generic > -xcache=generic -m64 -g' CPPFLAGS='-fast -xarch=generic -xtarget=generic > -xcache=generic -m64 -g' CXX=/opt/suncc/bin/CC CXXFLAGS='-march=native > -mfpmath=sse -pipe -m64 -g' LDFLAGS="-m64" > > Which, as expected, yields: > > asgard:root:/netstore/src/bacula-9.6.5:11 # grep CFLAGS src/stored/Makefile > CFLAGS = -fast -xarch=generic -xtarget=generic -xcache=generic -m64 -g I don't have any experience of this compiler, but maybe removing -fast will improve the pstack output? Also, the documentation mentions —xkeepframe=%all which might help. __Martin _______________________________________________ Bacula-devel mailing list Bacula-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-devel