Thanks Kern! I'll bring in a DBA on our side to have a look. Would you have any thoughts on this question posed earlier?
3. Why is Bacula spinning off a new job right away after it detects the deadlock for each affected job instead of waiting until the rescheduled job runs? I verified that there were no duplicate jobs in the queue before the backups started running, no jobs were running before the start of the backups, and I did not start any of these backups manually to cause a second job to appear. This happened on both nights I ran with Accurate turned On on the hosts that had failed backups because of the deadlock. Regards, -craig On Thu, Aug 6, 2015 at 9:48 AM, Kern Sibbald <k...@sibbald.com> wrote: > On 06.08.2015 21:44, Craig Shiroma wrote: > > Hi Kern, > > Thank you for the info! We're using MySQL 5.6 Percona Server, Release > 68.0, Revision 656. > > Would this setting cause the problem? > innodb_lock_wait_timeout = 100 > > Is it too high or too low or has no bearing on the problem? > > > Sorry, I am a Bacula programmer, and I do not know much about databases -- > especially MySQL since I use PostgreSQL. PostgreSQL is harder to install > and a bit harder to configure than MySQL, but it performs much better. > > > > Thanks again, > -craig > > > On Thu, Aug 6, 2015 at 9:26 AM, Kern Sibbald <k...@sibbald.com> wrote: > >> On 06.08.2015 18:46, Bryn Hughes wrote: >> >> I think what Kern is getting at is that your database is what threw the >> error, not Bacula. Whatever DB you are using is what is having the issue. >> >> >> Yes. That is exactly what I was implying. >> >> The rest of this is directed to Craig: >> If you are using MariaDB (I have no indication that you are), please be >> aware that it may be a very good database, maybe even better than MySQL, >> but Bacula is built and tested against MySQL, and if you use binaries that >> were built for MySQL, you could run into problems by using MariaDB. Even >> if your binaries were explicitly built with MariaDB, it may not be >> compatible with the way Bacula works. Bacula has a tendency to push >> databases to the extreme, and it works well with MySQL and PostgreSQL, but >> possibly not with other databases. I bring up MariaDB because it has been >> mentioned in another posting to this list. >> >> I would be very surprised if your problem has anything to do with >> Accurate -- the database routines know nothing about accurate and none of >> the data is different. It is more likely due to the VM environment or to >> some build or version problem with MySQL (or MariaDB). >> >> Best regards, >> Kern >> >> >> Bryn >> >> On 2015-08-06 09:11 AM, Craig Shiroma wrote: >> >> Hi Kern, >> >> Thank you very much for the reply! Would you have any suggestions on >> what may be causing this problem or how I can debug it? Obviously, I'm >> encountering deadlocks when accurate backup runs on some of our hosts and >> we want to use accurate backup on all of our hosts if possible. >> >> Warmest regards, >> -craig >> >> On Thu, Aug 6, 2015 at 12:11 AM, Kern Sibbald <k...@sibbald.com> wrote: >> >>> On 06.08.2015 10:15, Craig Shiroma wrote: >>> >>> Hello again, >>> >>> I just thought I'd update this post with more information in hopes of >>> getting some explanation for the deadlocks. >>> >>> I ran with Accurate backup on our test VMs (RHEL) for a couple of days >>> and got the same errors on some VMs that were running accurate and some >>> that were not. These hosts were running concurrently. I would say 90% of >>> the hosts that were configured to use Accurate finished successfully. >>> However, there were a few that failed with the deadlock error -- some that >>> were configured to use accurate and some that were not configured to use >>> accurate. Also, on all of these, a second job started for each of the >>> affected hosts right after Bacula detected the deadlock even though it said >>> a reschedule would happen 3600 seconds later (the 3600 seconds is correct). >>> >>> Tonight, I disabled accurate on all hosts and the deadlocks did not >>> happen. No errors were detected and all the backups finished successfully. >>> >>> Some questions... >>> 1. Can I back up multiple hosts concurrently with some hosts configured >>> to use accurate and some configured not to use accurate? Or, is it an all >>> or none thing, meaning all hosts that run concurrently must either be using >>> accurate backup or not using accurate backup (cannot mix the two)? >>> >>> 2. It seems like the hosts that get out of the starting gate first are >>> the ones affected. I am configured to run 50 jobs concurrently. Again, no >>> problems with accurate turned off on all hosts for months now. >>> >>> 3. Why is Bacula spinning off a new job right away after it detects the >>> deadlock for each affected job instead of waiting until the rescheduled job >>> runs? I verified that there were no duplicate jobs in the queue before the >>> backups started running, no jobs were running before the start of the >>> backups, and I did not start any of these backups manually to cause a >>> second job to appear. >>> >>> >>> Bacula is not aware of any SQL internal deadlocks. >>> >>> >>> From the INNODB Monitor output: >>> >>> TRANSACTION: >>> TRANSACTION 208788977, ACTIVE 1 sec setting auto-inc lock >>> mysql tables in use 4, locked 4 >>> 9 lock struct(s), heap size 1184, 5 row lock(s) >>> MySQL thread id 50808, OS thread handle 0x7f8f2c3b4700, query id >>> 29558637 <host> 192.168.10.99 bacula Sending data >>> INSERT INTO File (FileIndex, JobId, PathId, FilenameId, LStat, MD5, >>> DeltaSeq) SELECT batch.FileIndex, batch.JobId, Path.PathId, >>> Filename.FilenameId,batch.LStat, batch.MD5, batch.DeltaSeq FROM batch JOIN >>> Path ON (batch.Path = Path.Path) JOIN Filename ON (batch.Name = >>> Filename.Name) >>> WAITING FOR THIS LOCK TO BE GRANTED: >>> TABLE LOCK table `bacula`.`File` trx id 208788977 lock mode AUTO-INC >>> waiting >>> WE ROLL BACK TRANSACTION (2) >>> >>> I am running Bacula 7.0.5 on RHEL 6.6 x64 with Director, Storage and >>> Catalog running on separate RHEL 6.6 hosts. Our clients are RHEL 6's, 5's >>> and Windows Servers 2008 and 2012R2. >>> >>> Any help would be much appreciated. >>> >>> Warmest regards, >>> -craig >>> >>> On Tue, Aug 4, 2015 at 1:56 PM, Craig Shiroma <shiroma.crai...@gmail.com >>> > wrote: >>> >>>> BTW, I suppose there could've been two jobs for the host(s) in >>>> scheduling queue. If this was the case, is there a way to find out after >>>> the fact? If this did actually happen, what could cause duplicate jobs to >>>> be scheduled on the same day at the same time? I know no one manually ran >>>> the jobs in question. Again, this only was a problem for a few of the jobs >>>> that ran last night, not all of them and some to do accurate backup and >>>> some not. >>>> >>>> Regards, >>>> -craig >>>> >>>> On Tue, Aug 4, 2015 at 9:27 AM, Craig Shiroma < >>>> shiroma.crai...@gmail.com> wrote: >>>> >>>>> Hello, >>>>> >>>>> I had a few backups fail last night with the following error: >>>>> >>>>> 2015-08-03 18:02:46bacula-dir JobId 123984: b INTO File (FileIndex, >>>>> JobId, PathId, FilenameId, LStat, MD5, DeltaSeq) SELECT batch.FileIndex, >>>>> batch.JobId, Path.PathId, Filename.FilenameId,batch.LStat, batch.MD5, >>>>> batch.DeltaSeq FROM batch JOIN Path ON (batch.Path = Path.Path) JOIN >>>>> Filename ON (batch.Name = Filename.Name): ERR=Deadlock found when trying >>>>> to >>>>> get lock; try restarting transaction >>>>> >>>>> The only thing I did yesterday was switch a bunch of backups to use >>>>> Accurate backup and restart bacula-dir and bacula-sd after that. However, >>>>> the above problem also occurred on some hosts that was not set to use >>>>> Accurate backup. From the log, it seems like two jobs for this host was >>>>> scheduled to run at 18:00 because the second job started and found a >>>>> duplicate job (job 123984) and canceled the backup. I know there were no >>>>> jobs running before 18:00 so 123984 was not an old job still running. >>>>> Same >>>>> with the other jobs that were canceled because of the above situation. >>>>> >>>>> Anyway, does anyone have an idea what would cause this, especially how >>>>> the second job got shot into the system. After the deadlock error, Bacula >>>>> said it would reschedule the job. However the second job started right >>>>> after the deadlock error instead of one hour later which makes me think >>>>> that there were two jobs for this host scheduled to run at 18:00. >>>>> >>>>> Thank you in advance, >>>>> -craig >>>>> >>>> >>>> >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> >>> >>> _______________________________________________ >>> Bacula-users mailing >>> listBacula-users@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/bacula-users >>> >>> >>> >> >> >> ------------------------------------------------------------------------------ >> >> >> >> _______________________________________________ >> Bacula-users mailing >> listBacula-users@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/bacula-users >> >> >> >> >> ------------------------------------------------------------------------------ >> >> >> >> _______________________________________________ >> Bacula-users mailing >> listBacula-users@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/bacula-users >> >> >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> Bacula-users mailing list >> Bacula-users@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/bacula-users >> >> > >
------------------------------------------------------------------------------
_______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users