Re: [Bacula-devel] Copy job unexpectedly failed because of ... timing bug?

Heitor Faria Tue, 03 Dec 2019 04:48:21 -0800

Hello Phil,

This feature was already requested by one of my customers in Brazil and 
deployed by Bacula Systems:


Job Resource directive:

"Run = <job-name> The Run directive (not to be confused with the Run option in 
a Schedule) allows you to start other jobs or to clone jobs. By using the 
cloning keywords (see below), you can backup the same data (or almost the same 
data) to two or more drives at the same time. The <job-name> is normally the 
same name as the current Job resource (thus creating a clone). However, it may 
be any Job name, so one job may start other related jobs..."

E.g.: run = "Nightly-backup level=%l since=\"%s\" storage=DDS-4" 

Not sure if already in Community.

Regards,
--
MSc Heitor Faria
CEO Bacula LatAm
mobile1: + 1 909 655-8971
mobile2: + 55 61 98268-4220

América Latina
[ http://bacula.lat/]

-------- Original Message --------
From: Phil Stracchino <ph...@caerllewys.net>
Sent: Monday, December 2, 2019 02:09 PM
To: bacula-devel <bacula-devel@lists.sourceforge.net>
Subject: [Bacula-devel] Copy job unexpectedly failed because of ... timing bug?

>So, I'm now doing disk-to-disk-to-disk backup, with the third tier being
>a rotating ZFS pool of removable SATA disks.  My full backups ran this
>morning, and all of the backup jobs completed successfully without
>errors, but the archive copy job failed without reporting any errors.
>
>Here's the job definition:
>
>Job {
>  Name = "Archive Copy"
>  Type = Copy
>  Enabled = Yes
>  Pool = Full-Disk
>  Level = Full
>  Client = ALL
>  Fileset = DUMMY
>  Storage = asgard-archive
>  Selection Type = SQL Query
>  Selection Pattern = "SELECT DISTINCT J.JobId, J.StartTime
>                       FROM Job J JOIN Pool P ON P.PoolId = J.PoolId
>                       WHERE P.Name = 'Full-Disk'
>                       AND J.Type = 'B'
>                       AND J.JobStatus IN ('T','W')
>                       AND J.jobBytes > 0
>                       AND J.StartTime > now() - interval 24 hour"
>  Messages = Daemon
>  Priority = 20
>  Rerun Failed Levels = yes
>  Allow Duplicate Jobs = no
>  Cancel Queued Duplicates = yes
>  Schedule = "Full Backup Archive"
>}
>
>
>And here's what that selection pattern query returns if I run it
>manually, right now:
>
>MySQL 127.0.0.1> SELECT DISTINCT J.JobId, J.StartTime FROM Job J JOIN
>Pool P ON P.PoolId = J.PoolId WHERE P.Name = 'Full-Disk' AND J.Type =
>'B' AND J.JobStatus IN ('T','W') AND J.jobBytes > 0 AND J.StartTime >
>now() - interval 24 hour;
>+-------+---------------------+
>| JobId | StartTime           |
>+-------+---------------------+
>| 23476 | 2019-12-02 04:30:03 |
>| 23477 | 2019-12-02 04:30:03 |
>| 23478 | 2019-12-02 04:30:03 |
>| 23479 | 2019-12-02 04:30:03 |
>| 23480 | 2019-12-02 04:30:03 |
>| 23481 | 2019-12-02 04:30:04 |
>| 23482 | 2019-12-02 11:40:57 |
>+-------+---------------------+
>7 rows in set (0.001 sec)
>
>
>And here's the job output:
>
>
>02-Dec 05:00 minbar-dir JobId 23483: No JobIds found to copy.
>02-Dec 11:42 minbar-dir JobId 23483: Bacula minbar-dir 9.4.4 (28May19):
>  Build OS:               x86_64-pc-linux-gnu gentoo
>  Prev Backup JobId:      0
>  Prev Backup Job:
>  New Backup JobId:       0
>  Current JobId:          23483
>  Current Job:            Archive_Copy.2019-12-02_05.00.00_46
>  Backup Level:           Full
>  Client:                 ALL
>  FileSet:                "DUMMY" 2019-11-04 12:00:40
>  Read Pool:              "Full-Disk" (From Job resource)
>  Read Storage:           "asgard-archive" (From Job resource)
>  Write Pool:             "Full-Archive" (From Pool's NextPool resource)
>  Write Storage:          "asgard-archive" (From Pool's NextPool resource)
>  Catalog:                "Catalog" (From Client resource)
>  Start time:             02-Dec-2019 11:42:38
>  End time:               02-Dec-2019 11:42:38
>  Elapsed time:           1 sec
>  Priority:               20
>  SD Files Written:       0
>  SD Bytes Written:       0 (0 B)
>  Rate:                   0.0 KB/s
>  Volume name(s):
>  Volume Session Id:      0
>  Volume Session Time:    0
>  Last Volume Bytes:      0 (0 B)
>  SD Errors:              0
>  SD termination status:
>  Termination:            Copying -- no files to copy
>
>
>When I tried to rerun the Archive Copy job manually from BAT, the
>Director crashed.  When I restarted the Director and manually ran it
>AGAIN, this time it appears to be starting and running exactly as it
>should, the console reporting:
>
>run job="Archive Copy" fileset="DUMMY" level="Full" client="ALL"
>pool="Full-Disk" storage="asgard-file" priority="10" when="2019-12-02
>13:05:49" yes
>Job queued. JobId=23484
>02-Dec 13:05 minbar-dir JobId 23484: The following 7 JobIds were chosen
>to be copied: 23476,23477,23478,23479,23480,23481,23482
>02-Dec 13:05 minbar-dir JobId 23485: Copying using JobId=23476
>Job=Asgard_Backup.2019-12-02_04.30.00_39
>02-Dec 13:05 minbar-dir JobId 23484: Job queued. JobId=23485
>02-Dec 13:05 minbar-dir JobId 23484: Copying JobId 23485 started.
>02-Dec 13:05 minbar-dir JobId 23487: Copying using JobId=23477
>Job=Netstore_Backup.2019-12-02_04.30.00_40
>02-Dec 13:05 minbar-dir JobId 23484: Job queued. JobId=23487
>02-Dec 13:05 minbar-dir JobId 23484: Copying JobId 23487 started.
>02-Dec 13:05 minbar-dir JobId 23485: Start Copying JobId 23485,
>Job=Archive_Copy.2019-12-02_13.05.53_05
>02-Dec 13:05 minbar-dir JobId 23489: Copying using JobId=23478
>Job=Babylon5_Backup.2019-12-02_04.30.00_41
>02-Dec 13:05 minbar-dir JobId 23485: Using Device "FileStorage" to read.
>02-Dec 13:05 minbar-dir JobId 23484: Job queued. JobId=23489
>02-Dec 13:05 minbar-dir JobId 23484: Copying JobId 23489 started.
>02-Dec 13:05 minbar-dir JobId 23491: Copying using JobId=23479
>Job=Jumpgate_Backup.2019-12-02_04.30.00_42
>02-Dec 13:05 minbar-dir JobId 23484: Job queued. JobId=23491
>02-Dec 13:05 minbar-dir JobId 23484: Copying JobId 23491 started.
>02-Dec 13:05 minbar-dir JobId 23493: Copying using JobId=23480
>Job=Minbar_Backup.2019-12-02_04.30.00_43
>02-Dec 13:05 minbar-dir JobId 23486: Max configured use duration=82,800
>sec. exceeded. Marking Volume "ARCHIVE-20191104-12:00" as Used.
>02-Dec 13:05 minbar-dir JobId 23486: Created new
>Volume="ARCHIVE-20191202-13:05", Pool="Full-Archive", MediaType="File"
>in catalog.
>02-Dec 13:05 minbar-dir JobId 23486: Using Device "ArchiveCopy" to write.
>02-Dec 13:05 minbar-dir JobId 23484: Job queued. JobId=23493
>02-Dec 13:05 minbar-dir JobId 23484: Copying JobId 23493 started.
>02-Dec 13:05 minbar-dir JobId 23495: Copying using JobId=23481
>Job=Narn_Backup.2019-12-02_04.30.00_44
>02-Dec 13:05 asgard-sd JobId 23486: Labeled new Volume
>"ARCHIVE-20191202-13:05" on File device "ArchiveCopy" (/arcpool).
>02-Dec 13:05 minbar-dir JobId 23484: Job queued. JobId=23495
>02-Dec 13:05 minbar-dir JobId 23484: Copying JobId 23495 started.
>02-Dec 13:05 minbar-dir JobId 23484: Copying using JobId=23482
>Job=MySQL_Backup_New.2019-12-02_04.55.00_45
>02-Dec 13:05 asgard-sd JobId 23486: Wrote label to prelabeled Volume
>"ARCHIVE-20191202-13:05" on File device "ArchiveCopy" (/arcpool)
>02-Dec 13:05 asgard-sd JobId 23485: Ready to read from volume
>"FULL-20191202-04:30" on File device "FileStorage" (/spool/bacula).
>
>
>Now, what this LOOKS like to me is that the Director is evaluating the
>selection function at the time that the job is SCHEDULED, NOT at the
>time that it actually RUNS, at which time none of the jobs it is
>supposed to archive have actually completed yet, and therefore it
>doesn't find anything to archive.  Which seems to me to mean that I have
>to guess how long all the other jobs are going to take to run and not
>SCHEDULE the job until after they have all completed.  Which is going to
>be a problem if the jobs to be archived take longer than expected to
>run, for any reason.
>
>(The Director crash is a separate problem on which I have still gotten
>precisely nowhere.)
>
>
>It seems to me like this is a bug, and the selection criteria for a COPY
>job should not be resolved until the COPY job is actually started.
>
>
>Any comments...?
>
>
>-- 
>  Phil Stracchino
>  Babylon Communications
>  ph...@caerllewys.net
>  p...@co.ordinate.org
>  Landline: +1.603.293.8485
>  Mobile:   +1.603.998.6958
>
>
>_______________________________________________
>Bacula-devel mailing list
>Bacula-devel@lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/bacula-devel

_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Re: [Bacula-devel] Copy job unexpectedly failed because of ... timing bug?

Reply via email to