>>>>> On Thu, 2 Jul 2020 14:25:49 -0400, Phil Stracchino said:
> 
> On 2020-07-02 14:10, Martin Simmons wrote:
> >>>>>> On Wed, 1 Jul 2020 15:54:27 -0400, Phil Stracchino said:
> >>
> >> On 2020-06-26 17:58, Phil Stracchino wrote:
> >>> Oh, another detail I learned this morning:
> >>>
> >>> Cancelling a SINGLE hung job does not crash the Director.  It appears it
> >>> is only attempting to cancel a SECOND hung job that causes the Director
> >>> to crash.
> >>
> >>
> >> After four days of normal operation I got another hung job this morning,
> >> still with HAproxy not in use.  So it is *much less* prevalent without
> >> HAproxy but still occurring (one failure every 3-4 days instead of 2-3
> >> failed jobs per day).  This is the same running Director process as
> >> handled the failure mentioned above, and the Director was frozen and
> >> unresponsive for nearly ten minutes attempting to cancel the second job
> >> before it finally marked it as cancelled.  (But the job still shows as
> >> running and with a fatal error.)
> > 
> > I'm confused how it can be the same Director process.  Didn't that Director
> > crash when you attempted to cancel the second job?
> 
> 
> No, I had a single hung job which was successfully cancelled (though it
> took a long time), and then no more failures until yesterday when there
> was again a single hung job (which also took a long time to cancel but
> did eventually successfully cancel).  It seems as long as I don't try to
> cancel two hung jobs at once, there is no crash.

Ah, OK.

> 
> 
> > 
> > 
> >> ====
> >> *cancel
> >> Select Job(s):
> >>      1: JobId=25114 Job=Narn_Backup.2020-07-01_04.30.00_37
> >>      2: JobId=25115 Job=MySQL_Backup_New.2020-07-01_04.55.00_39
> >> Choose Job list to cancel (1-2): 1
> >> JobId=25114 Job=Narn_Backup.2020-07-01_04.30.00_37
> >> Confirm cancel of 1 Job (yes/no): yes
> >> Failed to connect to File daemon.
> > 
> > This failure to connect is strange.
> > 
> > 
> >> Martin, you're still convinced this has to do with the SD?
> > 
> > I think the hanging jobs are caused by a problem with how Bacula recovers 
> > from
> > an SQL error while inserting attributes into the database.  This is not a
> > problem with the SD itself, but the hanging does involve the SD so that's 
> > why
> > it would be useful to get simultaneous backtraces from all 3 daemons while
> > they are hanging.
> > 
> > 
> >>                                                             I'm still
> >> working on trying to figure out how I can get dbx installed so that I
> >> can backtrace the SD.  pstack isn't very informative:
> >>
> >> asgard:root:~:1 # pstack 10971
> >> 10971:  /opt/bacula/sbin/bacula-sd -v -c /opt/bacula/etc/bacula-sd.conf
> >> ------------  lwp# 1 / thread# 1  ---------------
> >>  ffff80ffbf580daa pollsys  (ffff80ffbfffcd10, 1, 0, 0)
> >>  ffff80ffbf525755 pselect () + 181
> >>  ffff80ffbf525bd4 select () + 68
> >>  ffff80ffb5211023 
> >> __1cSbnet_thread_server6FpnFdlist_ipnJworkq_tag_pFpv_4_v_ () + 963
> >>  0000000000419c84 main () + 724
> >>  3f763a7574735070 ???????? ()
> >> ------------  lwp# 3 / thread# 3  ---------------
> >>  ffff80ffbf577e97 lwp_park (0, ffff80ffbe5bbe30, 0)
> >>  ffff80ffbf5716fa cond_wait_queue () + 62
> >>  ffff80ffbf571b38 cond_wait_common () + 1dc
> >>  ffff80ffbf571dc7 __cond_timedwait () + a7
> >>  ffff80ffbf571e11 cond_timedwait () + 29
> >>  ffff80ffbf571e45 pthread_cond_timedwait () + 9
> >>  ffff80ffb526a41a watchdog_thread () + 57a
> >> ------------  lwp# 70 / thread# 70  ---------------
> >>  ffff80ffbf580e7a read     (10, ffff80ffbd3c495c, 4)
> >>  ffff80ffb524fdbe __1cJBSOCKCORELread_nbytes6Mpci_i_ () + 4e
> > 
> > So it just stops there?  That's annoying.
> > 
> > Which compiler and options were used to compile on Solaris?  What is the
> > setting for CFLAGS in the generated bacula/src/stored/Makefile?
> 
> 
> Sun Developer Studio 12.6
> 
> Configure invocation:
> 
> ./configure --prefix=/opt/bacula --with-dump-email=r...@caerllewys.net
> --with-job-email=r...@caerllewys.net
> --with-smtp-host=smtp.caerllewys.net --with-subsys-dir=/opt/bacula/var
> --with-working-dir=/opt/bacula/var --enable-build-stored
> --disable-build-dird --enable-smartalloc --with-mysql=/opt/mysql/mysql
> CC=/opt/suncc/bin/CC CFLAGS='-fast -xarch=generic -xtarget=generic
> -xcache=generic -m64 -g' CPPFLAGS='-fast -xarch=generic -xtarget=generic
> -xcache=generic -m64 -g' CXX=/opt/suncc/bin/CC CXXFLAGS='-march=native
> -mfpmath=sse -pipe -m64 -g' LDFLAGS="-m64"
> 
> Which, as expected, yields:
> 
> asgard:root:/netstore/src/bacula-9.6.5:11 # grep CFLAGS src/stored/Makefile
> CFLAGS = -fast -xarch=generic -xtarget=generic -xcache=generic -m64 -g

I don't have any experience of this compiler, but maybe removing -fast will
improve the pstack output?  Also, the documentation mentions —xkeepframe=%all
which might help.

__Martin


_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to