On 2020-07-14 12:18, Phil Stracchino wrote:
> At which I sheepishly notice that I have wsrep_auto_increment_control
> OFF when I'd have sworn it was on.  Which doesn't change the fact that
> it works that way with 9.6.3.
> 
> Nevertheless, I'll retest 9.6.5 with that change just to be certain.  If
> this solves the problem, it's an easy requirement.


Well, testing so far APPEARS to have no further hung jobs yet.  However,
I'm not confident I can declare that for sure yet, because it was also
occurring (at a lower rate) *without* HAproxy, connecting directly to
the local DB node.

However, I think I'm now ready to pretty much completely describe the
bug.  In this case, the misconfiguration was fortunate because it
exposed the bug.



1.  The triggering condition is when a DB record insertion fails for any
reason, *including recoverable failures* such as InnoDB rollbacks.
(MySQL uses rollbacks to notify the application of any of several types
of transient error, including deadlocks or lock wait time exceeded
during a transaction.  Galera *additionally* uses InnoDB rollbacks to
notify the application of a local commit conflict between nodes.)

2.  The correct action in the case of receiving a rollback from a
MySQL-compatible DB, either standalone *or* a Galera cluster, is to make
at least one attempt to resubmit the transaction.  Instead, Bacula's
MySQL driver is regarding all errors as fatal and immediately aborting
the entire job without retrying the insert.

3.  When the job is aborted because of a DB insertion error, the job is
marked as having a fatal error, but is not properly terminated, and
hangs indefinitely, potentially blocking other jobs with lower
priorities or waiting for the resources the hung job is using.

4.  The fatal error status is correctly reported in bconsole by 'status
dir', but the jobs list in BAT shows the job as still running because it
has not terminated.

5.  The failed job CAN be cancelled, but to complete the termination of
the job takes a long time, and if a second job is cancelled before the
first cancellation has completed, there is an extremely high likelihood
that the Director will crash.

6.  The bug manifests only in 9.6.5.



Mantis at bugs.bacula.org appears not to offer any version later than
9.2.1 for bug reports...?


-- 
  Phil Stracchino
  Babylon Communications
  ph...@caerllewys.net
  p...@co.ordinate.org
  Landline: +1.603.293.8485
  Mobile:   +1.603.998.6958


_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to