Re: [Bacula-users] Deadlock error

Craig Shiroma Thu, 06 Aug 2015 13:39:08 -0700

Thanks Kern!  I'll bring in a DBA on our side to have a look.

Would you have any thoughts on this question posed earlier?


3. Why is Bacula spinning off a new job right away after it detects the
deadlock for each affected job instead of waiting until the rescheduled job
runs?  I verified that there were no duplicate jobs in the queue before the
backups started running, no jobs were running before the start of the
backups, and I did not start any of these backups manually to cause a
second job to appear.

This happened on both nights I ran with Accurate turned On on the hosts
that had failed backups because of the deadlock.

Regards,
-craig

On Thu, Aug 6, 2015 at 9:48 AM, Kern Sibbald <k...@sibbald.com> wrote:

> On 06.08.2015 21:44, Craig Shiroma wrote:
>
> Hi Kern,
>
> Thank you for the info!  We're using MySQL 5.6 Percona Server, Release
> 68.0, Revision 656.
>
> Would this setting cause the problem?
> innodb_lock_wait_timeout = 100
>
> Is it too high or too low or has no bearing on the problem?
>
>
> Sorry, I am a Bacula programmer, and I do not know much about databases --
> especially MySQL since I use PostgreSQL.  PostgreSQL is harder to install
> and a bit harder to configure than MySQL, but it performs much better.
>
>
>
> Thanks again,
> -craig
>
>
> On Thu, Aug 6, 2015 at 9:26 AM, Kern Sibbald <k...@sibbald.com> wrote:
>
>> On 06.08.2015 18:46, Bryn Hughes wrote:
>>
>> I think what Kern is getting at is that your database is what threw the
>> error, not Bacula.  Whatever DB you are using is what is having the issue.
>>
>>
>> Yes.  That is exactly what I was implying.
>>
>> The rest of this is directed to Craig:
>> If you are using MariaDB (I have no indication that you are), please be
>> aware that it may be a very good database, maybe even better than MySQL,
>> but Bacula is built and tested against MySQL, and if you use binaries that
>> were built for MySQL, you could run into problems by using MariaDB.  Even
>> if your binaries were explicitly built with MariaDB, it may not be
>> compatible with the way Bacula works.  Bacula has a tendency to push
>> databases to the extreme, and it works well with MySQL and PostgreSQL, but
>> possibly not with other databases.  I bring up MariaDB because it has been
>> mentioned in another posting to this list.
>>
>> I would be very surprised if your problem has anything to do with
>> Accurate -- the database routines know nothing about accurate and none of
>> the data is different.  It is more likely due to the VM environment or to
>> some build or version problem with MySQL (or MariaDB).
>>
>> Best regards,
>> Kern
>>
>>
>> Bryn
>>
>> On 2015-08-06 09:11 AM, Craig Shiroma wrote:
>>
>> Hi Kern,
>>
>> Thank you very much for the reply!  Would you have any suggestions on
>> what may be causing this problem or how I can debug it?  Obviously, I'm
>> encountering deadlocks when accurate backup runs on some of our hosts and
>> we want to use accurate backup on all of our hosts if possible.
>>
>> Warmest regards,
>> -craig
>>
>> On Thu, Aug 6, 2015 at 12:11 AM, Kern Sibbald <k...@sibbald.com> wrote:
>>
>>> On 06.08.2015 10:15, Craig Shiroma wrote:
>>>
>>> Hello again,
>>>
>>> I just thought I'd update this post with more information in hopes of
>>> getting some explanation for the deadlocks.
>>>
>>> I ran with Accurate backup on our test VMs (RHEL) for a couple of days
>>> and got the same errors on some VMs that were running accurate and some
>>> that were not.  These hosts were running concurrently.  I would say 90% of
>>> the hosts that were configured to use Accurate finished successfully.
>>> However, there were a few that failed with the deadlock error -- some that
>>> were configured to use accurate and some that were not configured to use
>>> accurate.  Also, on all of these, a second job started for each of the
>>> affected hosts right after Bacula detected the deadlock even though it said
>>> a reschedule would happen 3600 seconds later (the 3600 seconds is correct).
>>>
>>> Tonight, I disabled accurate on all hosts and the deadlocks did not
>>> happen.  No errors were detected and all the backups finished successfully.
>>>
>>> Some questions...
>>> 1.  Can I back up multiple hosts concurrently with some hosts configured
>>> to use accurate and some configured not to use accurate?  Or, is it an all
>>> or none thing, meaning all hosts that run concurrently must either be using
>>> accurate backup or not using accurate backup (cannot mix the two)?
>>>
>>> 2. It seems like the hosts that get out of the starting gate first are
>>> the ones affected.  I am configured to run 50 jobs concurrently.  Again, no
>>> problems with accurate turned off on all hosts for months now.
>>>
>>> 3. Why is Bacula spinning off a new job right away after it detects the
>>> deadlock for each affected job instead of waiting until the rescheduled job
>>> runs?  I verified that there were no duplicate jobs in the queue before the
>>> backups started running, no jobs were running before the start of the
>>> backups, and I did not start any of these backups manually to cause a
>>> second job to appear.
>>>
>>>
>>> Bacula is not aware of any SQL internal deadlocks.
>>>
>>>
>>> From the INNODB Monitor output:
>>>
>>> TRANSACTION:
>>> TRANSACTION 208788977, ACTIVE 1 sec setting auto-inc lock
>>> mysql tables in use 4, locked 4
>>> 9 lock struct(s), heap size 1184, 5 row lock(s)
>>> MySQL thread id 50808, OS thread handle 0x7f8f2c3b4700, query id
>>> 29558637 <host> 192.168.10.99 bacula Sending data
>>> INSERT INTO File (FileIndex, JobId, PathId, FilenameId, LStat, MD5,
>>> DeltaSeq) SELECT batch.FileIndex, batch.JobId, Path.PathId,
>>> Filename.FilenameId,batch.LStat, batch.MD5, batch.DeltaSeq FROM batch JOIN
>>> Path ON (batch.Path = Path.Path) JOIN Filename ON (batch.Name =
>>> Filename.Name)
>>> WAITING FOR THIS LOCK TO BE GRANTED:
>>> TABLE LOCK table `bacula`.`File` trx id 208788977 lock mode AUTO-INC
>>> waiting
>>> WE ROLL BACK TRANSACTION (2)
>>>
>>> I am running Bacula 7.0.5 on RHEL 6.6 x64 with Director, Storage and
>>> Catalog running on separate RHEL 6.6 hosts.  Our clients are RHEL 6's, 5's
>>> and Windows Servers 2008 and 2012R2.
>>>
>>> Any help would be much appreciated.
>>>
>>> Warmest regards,
>>> -craig
>>>
>>> On Tue, Aug 4, 2015 at 1:56 PM, Craig Shiroma <shiroma.crai...@gmail.com
>>> > wrote:
>>>
>>>> BTW, I suppose there could've been two jobs for the host(s) in
>>>> scheduling queue.  If this was the case, is there a way to find out after
>>>> the fact?  If this did actually happen, what could cause duplicate jobs to
>>>> be scheduled on the same day at the same time?  I know no one manually ran
>>>> the jobs in question.  Again, this only was a problem for a few of the jobs
>>>> that ran last night, not all of them and some to do accurate backup and
>>>> some not.
>>>>
>>>> Regards,
>>>> -craig
>>>>
>>>> On Tue, Aug 4, 2015 at 9:27 AM, Craig Shiroma <
>>>> shiroma.crai...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I had a few backups fail last night with the following error:
>>>>>
>>>>> 2015-08-03 18:02:46bacula-dir JobId 123984: b INTO File (FileIndex,
>>>>> JobId, PathId, FilenameId, LStat, MD5, DeltaSeq) SELECT batch.FileIndex,
>>>>> batch.JobId, Path.PathId, Filename.FilenameId,batch.LStat, batch.MD5,
>>>>> batch.DeltaSeq FROM batch JOIN Path ON (batch.Path = Path.Path) JOIN
>>>>> Filename ON (batch.Name = Filename.Name): ERR=Deadlock found when trying 
>>>>> to
>>>>> get lock; try restarting transaction
>>>>>
>>>>> The only thing I did yesterday was switch a bunch of backups to use
>>>>> Accurate backup and restart bacula-dir and bacula-sd after that.  However,
>>>>> the above problem also occurred on some hosts that was not set to use
>>>>> Accurate backup.  From the log, it seems like two jobs for this host was
>>>>> scheduled to run at 18:00 because the second job started and found a
>>>>> duplicate job (job 123984) and canceled the backup.  I know there were no
>>>>> jobs running before 18:00 so 123984 was not an old job still running.  
>>>>> Same
>>>>> with the other jobs that were canceled because of the above situation.
>>>>>
>>>>> Anyway, does anyone have an idea what would cause this, especially how
>>>>> the second job got shot into the system.  After the deadlock error, Bacula
>>>>> said it would reschedule the job.  However the second job started right
>>>>> after the deadlock error instead of one hour later which makes me think
>>>>> that there were two jobs for this host scheduled to run at 18:00.
>>>>>
>>>>> Thank you in advance,
>>>>> -craig
>>>>>
>>>>
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>>
>>>
>>> _______________________________________________
>>> Bacula-users mailing 
>>> listBacula-users@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/bacula-users
>>>
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>>
>>
>> _______________________________________________
>> Bacula-users mailing 
>> listBacula-users@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/bacula-users
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>>
>>
>> _______________________________________________
>> Bacula-users mailing 
>> listBacula-users@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/bacula-users
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> Bacula-users mailing list
>> Bacula-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/bacula-users
>>
>>
>
>

------------------------------------------------------------------------------

_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] Deadlock error

Reply via email to