Hi Ana,

You are absolutely correct!  I didn't even notice that.  Looking at the
log, that appears to be the case.

Thank you so much!
-craig


On Fri, Jul 24, 2015 at 3:23 AM, Ana Emília M. Arruda <
emiliaarr...@gmail.com> wrote:

> Hello Craig,
>
> It seems you have this job scheduled every day at about 18 pm. Am I
> correct? If this is the case, I think that your job scheduled to run at 
> 2015-07-20
> (maybe at 18 pm), started at 18:01:39 but did not finished until 2015-07-21
> 04:49:42. So the JobId 118594 is the same job that was scheduled for the
> next day, 2015-07-21 at 18 pm, and this job had been waiting for the
> jobid 118277 to finish (since you do not have allowed duplicate jobs).
> Since the job 11827 was canceled, no more retries would be carried out.
>
> Best regards,
> Ana
>
> On Fri, Jul 24, 2015 at 1:41 AM, Craig Shiroma <shiroma.crai...@gmail.com>
> wrote:
>
>> Hi Ana,
>>
>> >Sorry for the delay.
>>
>> No apologizes necessary!  Thank you for the help.  I appreciate it.
>>
>> >Don't you have log lines for the retry that should occur at 21-Jul-2015
>> 05:49?
>>
>> I see no entry in the log file indicating job 118277 was restarted.
>>
>> >Have you manually canceled the JobId 118277?
>>
>> No, no one manually cancelled job 118277.  I'm guessing Bacula did
>> somehow.
>>
>> Would you have any idea what would cause job 118594 to be started?  Up
>> to now, I thought it was the restarted job under a new jobId.
>>
>> The above situation occurs occasionally (not frequently) for other host
>> backups.  I have not been able to figure why.
>>
>> This is what shows up if I do a 'list files job=<host>:
>>
>> <snipped>
>> | 117,865 | <host> | 2015-07-19 18:01:35 | B    | I     |        70 |
>>  13,864,818 | T         |
>> | 118,277 | <host> | 2015-07-20 18:01:39 | B    | F     | 1,631,292 |
>> 311,164,701,221 | f         |
>> | 118,594 | <host> | 2015-07-21 04:49:57 | B    | F     |         0 |
>>             0 | A         |
>> <snipped>
>>
>> I did run a test to simulate what happened on our test Bacula
>> environment.  I run a backup, then while it is running I shutdown the
>> bacula-sd daemon on the storage host.  The exact same thing happened.  I
>> get the same (or similar) network error, the same 'job will be restarted in
>> 3600 seconds' message, another job starts right away with a different jobId
>> and the original job is cancelled.  No job restart with the same jobId
>> occurs 1 hour later.
>>
>> Many thanks,
>> -craig
>>
>> On Thu, Jul 23, 2015 at 5:21 PM, Ana Emília M. Arruda <
>> emiliaarr...@gmail.com> wrote:
>>
>>> Hello Craig,
>>>
>>> Sorry for the delay.
>>>
>>> The second email does not seems to be a retry. All the retries for a job
>>> have the same JobId. That is the same JobId will be retryed the "reschedule
>>> times" you had configured for the Job.
>>>
>>> On Wed, Jul 22, 2015 at 4:49 AM, Craig Shiroma <
>>> shiroma.crai...@gmail.com> wrote:
>>>
>>>> Hi Ana,
>>>>
>>>> Thank you for the reply.
>>>>
>>>> Sorry, yes that is what I meant.
>>>>
>>>> I'm wondering if Bacula canceled the job due to the Fatal error and the
>>>> below line included in the first "bacula error" email actually shouldn't be
>>>> there (i.e. erroneous) because no retry was made at 05:49 (3600 seconds
>>>> later).  However, the second "bacula error" email seems to indicate a retry
>>>> was actually attempted although at the wrong time (2015-07-21 04:49:57).
>>>>
>>>>
>>>> 2015-07-21 04:49:57bacula-dir JobId 118277: Rescheduled Job
>>>> <host>.2015-07-20_18.00.03_31 at 21-Jul-2015 04:49 to re-run in 3600
>>>> seconds (21-Jul-2015 05:49).
>>>>
>>>> Thanks again!
>>>> -craig
>>>>
>>>>
>>>>
>>>> On Tue, Jul 21, 2015 at 4:55 PM, Ana Emília M. Arruda <
>>>> emiliaarr...@gmail.com> wrote:
>>>>
>>>>> Hello Craig,
>>>>>
>>>>> Do you mean reschedule interval = 1 hour and reschedule times = 3?
>>>>>
>>>>> Best regards,
>>>>> Ana
>>>>> Em ter, 21 de jul de 2015 às 19:29, Craig Shiroma <
>>>>> shiroma.crai...@gmail.com> escreveu:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I had a backup that failed this morning due to what looks like a
>>>>>> network problem.
>>>>>>
>>>>>> First error email:
>>>>>> 2015-07-21 04:49:42<host> JobId 118277: Error: lib/bsock.c:693 Write
>>>>>> error sending 65813 bytes to Storage daemon:<host>:9103:
>>>>>> ERR=Input/output error
>>>>>> 2015-07-21 04:49:42<host> JobId 118277: Fatal error:
>>>>>> filed/backup.c:1282 Network send error to SD. ERR=Input/output error
>>>>>> ...
>>>>>> FD termination status:  Error
>>>>>> SD termination status:  Error
>>>>>> Termination:            *** Backup Error ***
>>>>>>
>>>>>> 2015-07-21 04:49:57bacula-dir JobId 118277: Rescheduled Job
>>>>>> <host>.2015-07-20_18.00.03_31 at 21-Jul-2015 04:49 to re-run in 3600
>>>>>> seconds (21-Jul-2015 05:49).
>>>>>>
>>>>>
>>> ​This means that your Job (scheduled to run at 18:00 2015-07-20) was
>>> rescheduled to run at 21-Jul-2015 05:49 (1 hour passed 21-Jul-2015
>>> 04:49, the time it started).
>>>
>>>
>>>>
>>>>>>
>>>>>> I have Retry set to 3 and retry interval set to 1 hour.  However, the
>>>>>> retry was canceled as seen below.  I'm wondering how I should interpret 
>>>>>> the
>>>>>> below messages.  Did Bacula notice the job was still running when it 
>>>>>> tried
>>>>>> to reschedule the job and then cancel it right then (practically at the
>>>>>> same time the above error occurred)?  Or, did it actually try to re-run 
>>>>>> the
>>>>>> job right after the error instead of one hour later (5:49) and cancel the
>>>>>> job?  If so, why did it try to re-run the job right after the error 
>>>>>> instead
>>>>>> of one hour later?
>>>>>>
>>>>>> Second error email:
>>>>>> 2015-07-21 04:49:57bacula-dir JobId 118594: Fatal error: JobId 118277
>>>>>> already running. Duplicate job not allowed.
>>>>>>
>>>>>
>>> ​It seems that a duplicate job (​JobId 118594) started at a time that
>>> the first one was still running (JobId 118277).
>>>
>>>
>>>>
>>>>>> 2015-07-21 04:49:57bacula-dir JobId 118594: Bacula bacula-dir 7.0.5
>>>>>> (28Jul14):
>>>>>>
>>>>>> FD termination status:
>>>>>> SD termination status:
>>>>>> Termination:            Backup Canceled
>>>>>>
>>>>>>
>>> ​The duplicate job was canceled. Don't you have log lines for the retry
>>> that should occur at 21-Jul-2015 05:49? Have you manually canceled the
>>> JobId 118277? If so, no retry would be carried out.
>>>
>>>
>>>>
>>>>>>
>>>>>> Best regards,
>>>>>> Craig
>>>>>>
>>>>>
>>> ​Best regards,
>>> Ana​
>>>
>>>
>>>
>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> Don't Limit Your Business. Reach for the Cloud.
>>>>>> GigeNET's Cloud Solutions provide you with the tools and support that
>>>>>> you need to offload your IT needs and focus on growing your business.
>>>>>> Configured For All Businesses. Start Your Cloud Today.
>>>>>> https://www.gigenetcloud.com/
>>>>>> _______________________________________________
>>>>>> Bacula-users mailing list
>>>>>> Bacula-users@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/bacula-users
>>>>>>
>>>>>
>>>>
>>>
>>
>
------------------------------------------------------------------------------
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to