Since you have a good release and a bad release, you could try git-bisect to identify (or at least narrow down) which commit introduced the problem. I can't remember which system you said you had the director on, but did you say you are compiling the director yourself or at least are able to? How many days of running do you think it takes to definitively say if the bug is present/absent in a given configuration? Do your distro's patches (if any) differ between 9.6.3/5?
On Wed, Aug 12, 2020, 11:53 AM Phil Stracchino <ph...@caerllewys.net> wrote: > Summarizing what I've learned. > > 1. This problem ONLY affects Bacula Director 9.6.5. Running a 9.6.5 > Director with every other daemon 9.6.3 is sufficient to trigger the bug. > Running a 9.6.3 Director with everything else 9.6.5 is sufficient to > prevent it. > > 2. With Director 9.6.5, the hung-job bug occurs regardless of whether > Galera auto-increment control is in use, even if Bacula is connecting > directly to a single node without using HAproxy, even if the cluster is > brought down to a single node (i.e. Galera not active). With Director > 9.6.3, the bug does NOT occur even when Bacula DB connections are > load-balanced across the cluster using HAproxy, even if auto-increment > control is off. > > 3. The bug occurs with 9.6.5, and not with 9.6.3, using the exact same > *build* of MariaDB. > > > To me, this strongly points to a bug introduced into the Director or the > MySQL driver between 9.6.3 and 9.6.5. However, I have looked at the > code changes from 9.6.3 to 9.6.5, but have not been able to spot a > "smoking gun". I don't see a single change that looks to me as though > it would have an impact. > > > I ran for a couple of weeks with dird 9.6.5 with every possible measure > to prevent DB conflicts and still got hung jobs. Then I rolled ONLY the > Director back to 9.6.3 again and rolled back the DB connection config to > what I'd been using for the past [mumble] years, and have not seen a > single hung job in the last 12 days. > > I'm getting nowhere trying to narrow this down any further. If anyone > who knows and understands the codebase better than I do would care to > look at the code diffs from 9.6.3 to 9.6.5 in the context of the stack > traces I've been able to provide, it would be much appreciated. I've > done all I can to troubleshoot this and I am unable to point a finger to > any particular piece of code, but the evidence that this is a regression > in Director 9.6.5 looks pretty damning to me. > > > > -- > Phil Stracchino > Babylon Communications > ph...@caerllewys.net > p...@co.ordinate.org > Landline: +1.603.293.8485 > Mobile: +1.603.998.6958 > > > _______________________________________________ > Bacula-devel mailing list > Bacula-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/bacula-devel >
_______________________________________________ Bacula-devel mailing list Bacula-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-devel