On 2020-06-22 13:08, Martin Simmons wrote:
>>>>>> On Sun, 21 Jun 2020 09:08:49 -0400, Phil Stracchino said:
>>
>> On 2020-06-20 14:33, Phil Stracchino wrote:
>>> OK, two days with zero hung jobs.  I am proceeding with re-upgrading
>>> ONLY the Director (well, and that host's FD) to 9.6.5.
>>
>>
>> That got me three successful jobs, one failed, and two hung.  Here's the
>> failure:
>>
>>
>> 21-Jun 04:30 minbar-dir JobId 25026: Fatal error: sql_create.c:968
>> Create db File
>> +record INSERT INTO File
>> (FileIndex,JobId,PathId,FilenameId,LStat,MD5,DeltaSeq)
>> +VALUES (2,25026,122083,109,'R0AAQAM E EHt H A A -B H IA F Be7wj5 BciQ9i
>> BciQ9i A A
>> +C','0',0) failed. ERR=Deadlock found when trying to get lock; try
>> restarting
>> +transaction21-Jun 04:30 minbar-dir JobId 25026: Fatal error:
>> catreq.c:513 Attribute
>> +create error: ERR=sql_create.c:968 Create db File record INSERT INTO File
>> +(FileIndex,JobId,PathId,FilenameId,LStat,MD5,DeltaSeq) VALUES
>> +(2,25026,122083,109,'R0AAQAM E EHt H A A -B H IA F Be7wj5 BciQ9i BciQ9i A A
>> +C','0',0) failed. ERR=Deadlock found when trying to get lock; try
>> restarting
>> +transaction21-Jun 04:30 asgard-fd JobId 25026: Error: bsock.c:383 Write
>> error
>> +sending 79 bytes to Storage daemon:asgard.caerllewys.net:9103:
>> ERR=Broken pipe
> 
> BTW, did you cancel the job in this case or did it crash by itself?

That job "cleanly" failed without causing any additional problems other
than forcing me to manually re-run it.


>> Once again it is somehow creating a local commit conflict on the
>> cluster.  I CAN configure Bacula to send all transactions to a single
>> node of the cluster instead of load-balancing them; however, Bacula
>> SHOULD be detecting reported deadlocks and retrying on its own.
> 
> I don't think Bacula has any code to that retries after deadlocks, mainly
> because it doesn't expect them (nothing else should be writing to the 
> database and it takes care not to create deadlocks between threads).

Then if for any reason it SHOULD encounter a deadlock, it will fail.  As
it did here.  Nothing else *was* writing to the Bacula database, except
other Director threads running other jobs.  There would have been six
jobs from five clients running at the time.  It appears a race condition
occurred.

Any non-trivial application that uses MySQL as a back-end should be
prepared for the possibility of a deadlock and check the return value
from DB calls.  This is one of those standard best practices with any
DB:  Don't assume, *verify* that your transaction was correctly applied.
 An application that does not, could fail to detect that a write was not
applied, resulting in inconsistent data.  (I have worked with customer
applications that "fake" it by doing an immediate read-back and
comparing to what they wrote, but that is wasteful and inefficient.)


-- 
  Phil Stracchino
  Babylon Communications
  ph...@caerllewys.net
  p...@co.ordinate.org
  Landline: +1.603.293.8485
  Mobile:   +1.603.998.6958


_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to