Re: [Bacula-devel] Hung jobs: continued diagnosis

Martin Simmons Tue, 14 Jul 2020 08:31:21 -0700

>>>>> On Mon, 13 Jul 2020 15:26:17 -0400, Phil Stracchino said:
> 
> On 2020-07-13 13:59, Martin Simmons wrote:
> >>>>>> On Sun, 12 Jul 2020 14:32:44 -0400, Phil Stracchino said:
> >>
> >> On 2020-07-12 14:12, Phil Stracchino wrote:
> >>> To test this theory I have built a 9.6.5 director with LZO support
> >>> disabled and am testing it now.
> >>
> >> Well, that didn't work.
> >>
> >> But this does definitely now seem to be related to lzo/lz4 decompression
> >> failures on the 9.6.3/9.6.5 SD that were not happening with a 9.6.3
> >> Director.  So that's narrowed it down quite a bit.
> > 
> > I think Bacula's lzo support is only used by bacula-fd and bextract.
> > 
> > This lz4 stuff is probably controlled by the Comm Compression directive.
> 
> 
> I'm not using that directive anywhere.  I've found the documentation on
> it and could try turning it off, but with more thought and Radosław's
> comment, my gut feeling is that it's smoke and noise that's not related
> to the actual problem, since it's never been an issue before now.  And
> it stands to reason that if the stream from the client to the SD stalls
> because of a fatal error in the job, then decompression of the stalled
> stream is naturally going to choke and trigger the watchdog timeout.
> 
> 
> I think the heart of the problem is somewhere around sql_create.c:968,
> sql_insert_autokey_record().  But in my reading of the code I didn't
> find anything obviously relevant that had changed between 9.6.3 and
> 9.6.5, and it's working perfectly in 9.6.3.


Sorry if you've already mentioned it, but is the 9.6.3 Director the same old
binary as you used in the past?  Or have you recompiled it recently?  If it is
the old binary, maybe something else has changed that affects compilation, so
you could try recompiling 9.6.3 to check that still works.


> About all I do seem to be able to say for sure here is that the problem
> is something to do with how the Director is talking to the database that
> becomes an issue when using a Galera cluster, is possibly related to how
> Bacula handles (or does not handle) rollbacks from the database (i.e, in
> case of receiving a rollback it appears to abort instead of retrying as
> it should), does not occur in any Director version before 9.6.5, and
> does not require that any Bacula daemon other than the Director be
> updated from 9.6.3 to 9.6.5.  The Director alone is both necessary and
> sufficient.
> 
> It seems likely, based upon my understanding of MySQL and of Galera
> clusters, that the underlying root cause of the problem may be a race
> condition between threads attempting to create two rows with the same id.

Yes, but according to https://mariadb.org/auto-increments-in-galera/ this is
not supposed to happen in Galera.

__Martin


_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Re: [Bacula-devel] Hung jobs: continued diagnosis

Reply via email to