On 2020-07-14 12:18, Phil Stracchino wrote: > At which I sheepishly notice that I have wsrep_auto_increment_control > OFF when I'd have sworn it was on. Which doesn't change the fact that > it works that way with 9.6.3. > > Nevertheless, I'll retest 9.6.5 with that change just to be certain. If > this solves the problem, it's an easy requirement.
Well, testing so far APPEARS to have no further hung jobs yet. However, I'm not confident I can declare that for sure yet, because it was also occurring (at a lower rate) *without* HAproxy, connecting directly to the local DB node. However, I think I'm now ready to pretty much completely describe the bug. In this case, the misconfiguration was fortunate because it exposed the bug. 1. The triggering condition is when a DB record insertion fails for any reason, *including recoverable failures* such as InnoDB rollbacks. (MySQL uses rollbacks to notify the application of any of several types of transient error, including deadlocks or lock wait time exceeded during a transaction. Galera *additionally* uses InnoDB rollbacks to notify the application of a local commit conflict between nodes.) 2. The correct action in the case of receiving a rollback from a MySQL-compatible DB, either standalone *or* a Galera cluster, is to make at least one attempt to resubmit the transaction. Instead, Bacula's MySQL driver is regarding all errors as fatal and immediately aborting the entire job without retrying the insert. 3. When the job is aborted because of a DB insertion error, the job is marked as having a fatal error, but is not properly terminated, and hangs indefinitely, potentially blocking other jobs with lower priorities or waiting for the resources the hung job is using. 4. The fatal error status is correctly reported in bconsole by 'status dir', but the jobs list in BAT shows the job as still running because it has not terminated. 5. The failed job CAN be cancelled, but to complete the termination of the job takes a long time, and if a second job is cancelled before the first cancellation has completed, there is an extremely high likelihood that the Director will crash. 6. The bug manifests only in 9.6.5. Mantis at bugs.bacula.org appears not to offer any version later than 9.2.1 for bug reports...? -- Phil Stracchino Babylon Communications ph...@caerllewys.net p...@co.ordinate.org Landline: +1.603.293.8485 Mobile: +1.603.998.6958 _______________________________________________ Bacula-devel mailing list Bacula-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-devel