Re: [Bacula-devel] Hung jobs: continued diagnosis

Phil Stracchino Mon, 13 Jul 2020 12:28:12 -0700

On 2020-07-13 13:59, Martin Simmons wrote:
>>>>>> On Sun, 12 Jul 2020 14:32:44 -0400, Phil Stracchino said:
>>
>> On 2020-07-12 14:12, Phil Stracchino wrote:
>>> To test this theory I have built a 9.6.5 director with LZO support
>>> disabled and am testing it now.
>>
>> Well, that didn't work.
>>
>> But this does definitely now seem to be related to lzo/lz4 decompression
>> failures on the 9.6.3/9.6.5 SD that were not happening with a 9.6.3
>> Director.  So that's narrowed it down quite a bit.
> 
> I think Bacula's lzo support is only used by bacula-fd and bextract.
> 
> This lz4 stuff is probably controlled by the Comm Compression directive.



I'm not using that directive anywhere.  I've found the documentation on
it and could try turning it off, but with more thought and Radosław's
comment, my gut feeling is that it's smoke and noise that's not related
to the actual problem, since it's never been an issue before now.  And
it stands to reason that if the stream from the client to the SD stalls
because of a fatal error in the job, then decompression of the stalled
stream is naturally going to choke and trigger the watchdog timeout.


I think the heart of the problem is somewhere around sql_create.c:968,
sql_insert_autokey_record().  But in my reading of the code I didn't
find anything obviously relevant that had changed between 9.6.3 and
9.6.5, and it's working perfectly in 9.6.3.

About all I do seem to be able to say for sure here is that the problem
is something to do with how the Director is talking to the database that
becomes an issue when using a Galera cluster, is possibly related to how
Bacula handles (or does not handle) rollbacks from the database (i.e, in
case of receiving a rollback it appears to abort instead of retrying as
it should), does not occur in any Director version before 9.6.5, and
does not require that any Bacula daemon other than the Director be
updated from 9.6.3 to 9.6.5.  The Director alone is both necessary and
sufficient.

It seems likely, based upon my understanding of MySQL and of Galera
clusters, that the underlying root cause of the problem may be a race
condition between threads attempting to create two rows with the same id.


-- 
  Phil Stracchino
  Babylon Communications
  [email protected]
  [email protected]
  Landline: +1.603.293.8485
  Mobile:   +1.603.998.6958


_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Re: [Bacula-devel] Hung jobs: continued diagnosis

Reply via email to