Since you have a good release and a bad release, you could try git-bisect
to identify (or at least narrow down) which commit introduced the problem.
I can't remember which system you said you had the director on, but did you
say you are compiling the director yourself or at least are able to?  How
many days of running do you think it takes to definitively say if the bug
is present/absent in a given configuration?  Do your distro's patches (if
any) differ between 9.6.3/5?



On Wed, Aug 12, 2020, 11:53 AM Phil Stracchino <ph...@caerllewys.net> wrote:

> Summarizing what I've learned.
>
> 1.  This problem ONLY affects Bacula Director 9.6.5.  Running a 9.6.5
> Director with every other daemon 9.6.3 is sufficient to trigger the bug.
>  Running a 9.6.3 Director with everything else 9.6.5 is sufficient to
> prevent it.
>
> 2.  With Director 9.6.5, the hung-job bug occurs regardless of whether
> Galera auto-increment control is in use, even if Bacula is connecting
> directly to a single node without using HAproxy, even if the cluster is
> brought down to a single node (i.e. Galera not active).  With Director
> 9.6.3, the bug does NOT occur even when Bacula DB connections are
> load-balanced across the cluster using HAproxy, even if auto-increment
> control is off.
>
> 3.  The bug occurs with 9.6.5, and not with 9.6.3, using the exact same
> *build* of MariaDB.
>
>
> To me, this strongly points to a bug introduced into the Director or the
> MySQL driver between 9.6.3 and 9.6.5.  However, I have looked at the
> code changes  from 9.6.3 to 9.6.5, but have not been able to spot a
> "smoking gun".  I don't see a single change that looks to me as though
> it would have an impact.
>
>
> I ran for a couple of weeks with dird 9.6.5 with every possible measure
> to prevent DB conflicts and still got hung jobs.  Then I rolled ONLY the
> Director back to 9.6.3 again and rolled back the DB connection config to
> what I'd been using for the past [mumble] years, and have not seen a
> single hung job in the last 12 days.
>
> I'm getting nowhere trying to narrow this down any further.  If anyone
> who knows and understands the codebase better than I do would care to
> look at the code diffs from 9.6.3 to 9.6.5 in the context of the stack
> traces I've been able to provide, it would be much appreciated.  I've
> done all I can to troubleshoot this and I am unable to point a finger to
> any particular piece of code, but the evidence that this is a regression
> in Director 9.6.5 looks pretty damning to me.
>
>
>
> --
>   Phil Stracchino
>   Babylon Communications
>   ph...@caerllewys.net
>   p...@co.ordinate.org
>   Landline: +1.603.293.8485
>   Mobile:   +1.603.998.6958
>
>
> _______________________________________________
> Bacula-devel mailing list
> Bacula-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/bacula-devel
>
_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to