On 2020-07-13 13:59, Martin Simmons wrote: >>>>>> On Sun, 12 Jul 2020 14:32:44 -0400, Phil Stracchino said: >> >> On 2020-07-12 14:12, Phil Stracchino wrote: >>> To test this theory I have built a 9.6.5 director with LZO support >>> disabled and am testing it now. >> >> Well, that didn't work. >> >> But this does definitely now seem to be related to lzo/lz4 decompression >> failures on the 9.6.3/9.6.5 SD that were not happening with a 9.6.3 >> Director. So that's narrowed it down quite a bit. > > I think Bacula's lzo support is only used by bacula-fd and bextract. > > This lz4 stuff is probably controlled by the Comm Compression directive.
I'm not using that directive anywhere. I've found the documentation on it and could try turning it off, but with more thought and Radosław's comment, my gut feeling is that it's smoke and noise that's not related to the actual problem, since it's never been an issue before now. And it stands to reason that if the stream from the client to the SD stalls because of a fatal error in the job, then decompression of the stalled stream is naturally going to choke and trigger the watchdog timeout. I think the heart of the problem is somewhere around sql_create.c:968, sql_insert_autokey_record(). But in my reading of the code I didn't find anything obviously relevant that had changed between 9.6.3 and 9.6.5, and it's working perfectly in 9.6.3. About all I do seem to be able to say for sure here is that the problem is something to do with how the Director is talking to the database that becomes an issue when using a Galera cluster, is possibly related to how Bacula handles (or does not handle) rollbacks from the database (i.e, in case of receiving a rollback it appears to abort instead of retrying as it should), does not occur in any Director version before 9.6.5, and does not require that any Bacula daemon other than the Director be updated from 9.6.3 to 9.6.5. The Director alone is both necessary and sufficient. It seems likely, based upon my understanding of MySQL and of Galera clusters, that the underlying root cause of the problem may be a race condition between threads attempting to create two rows with the same id. -- Phil Stracchino Babylon Communications ph...@caerllewys.net p...@co.ordinate.org Landline: +1.603.293.8485 Mobile: +1.603.998.6958 _______________________________________________ Bacula-devel mailing list Bacula-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-devel