On 2020-06-22 13:08, Martin Simmons wrote: >>>>>> On Sun, 21 Jun 2020 09:08:49 -0400, Phil Stracchino said: >> >> On 2020-06-20 14:33, Phil Stracchino wrote: >>> OK, two days with zero hung jobs. I am proceeding with re-upgrading >>> ONLY the Director (well, and that host's FD) to 9.6.5. >> >> >> That got me three successful jobs, one failed, and two hung. Here's the >> failure: >> >> >> 21-Jun 04:30 minbar-dir JobId 25026: Fatal error: sql_create.c:968 >> Create db File >> +record INSERT INTO File >> (FileIndex,JobId,PathId,FilenameId,LStat,MD5,DeltaSeq) >> +VALUES (2,25026,122083,109,'R0AAQAM E EHt H A A -B H IA F Be7wj5 BciQ9i >> BciQ9i A A >> +C','0',0) failed. ERR=Deadlock found when trying to get lock; try >> restarting >> +transaction21-Jun 04:30 minbar-dir JobId 25026: Fatal error: >> catreq.c:513 Attribute >> +create error: ERR=sql_create.c:968 Create db File record INSERT INTO File >> +(FileIndex,JobId,PathId,FilenameId,LStat,MD5,DeltaSeq) VALUES >> +(2,25026,122083,109,'R0AAQAM E EHt H A A -B H IA F Be7wj5 BciQ9i BciQ9i A A >> +C','0',0) failed. ERR=Deadlock found when trying to get lock; try >> restarting >> +transaction21-Jun 04:30 asgard-fd JobId 25026: Error: bsock.c:383 Write >> error >> +sending 79 bytes to Storage daemon:asgard.caerllewys.net:9103: >> ERR=Broken pipe > > BTW, did you cancel the job in this case or did it crash by itself?
That job "cleanly" failed without causing any additional problems other than forcing me to manually re-run it. >> Once again it is somehow creating a local commit conflict on the >> cluster. I CAN configure Bacula to send all transactions to a single >> node of the cluster instead of load-balancing them; however, Bacula >> SHOULD be detecting reported deadlocks and retrying on its own. > > I don't think Bacula has any code to that retries after deadlocks, mainly > because it doesn't expect them (nothing else should be writing to the > database and it takes care not to create deadlocks between threads). Then if for any reason it SHOULD encounter a deadlock, it will fail. As it did here. Nothing else *was* writing to the Bacula database, except other Director threads running other jobs. There would have been six jobs from five clients running at the time. It appears a race condition occurred. Any non-trivial application that uses MySQL as a back-end should be prepared for the possibility of a deadlock and check the return value from DB calls. This is one of those standard best practices with any DB: Don't assume, *verify* that your transaction was correctly applied. An application that does not, could fail to detect that a write was not applied, resulting in inconsistent data. (I have worked with customer applications that "fake" it by doing an immediate read-back and comparing to what they wrote, but that is wasteful and inefficient.) -- Phil Stracchino Babylon Communications ph...@caerllewys.net p...@co.ordinate.org Landline: +1.603.293.8485 Mobile: +1.603.998.6958 _______________________________________________ Bacula-devel mailing list Bacula-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-devel