[Bacula-devel] Copy job unexpectedly failed because of ... timing bug?

Phil Stracchino Mon, 02 Dec 2019 11:06:05 -0800

So, I'm now doing disk-to-disk-to-disk backup, with the third tier being
a rotating ZFS pool of removable SATA disks.  My full backups ran this
morning, and all of the backup jobs completed successfully without
errors, but the archive copy job failed without reporting any errors.


Here's the job definition:

Job {
  Name = "Archive Copy"
  Type = Copy
  Enabled = Yes
  Pool = Full-Disk
  Level = Full
  Client = ALL
  Fileset = DUMMY
  Storage = asgard-archive
  Selection Type = SQL Query
  Selection Pattern = "SELECT DISTINCT J.JobId, J.StartTime
                       FROM Job J JOIN Pool P ON P.PoolId = J.PoolId
                       WHERE P.Name = 'Full-Disk'
                       AND J.Type = 'B'
                       AND J.JobStatus IN ('T','W')
                       AND J.jobBytes > 0
                       AND J.StartTime > now() - interval 24 hour"
  Messages = Daemon
  Priority = 20
  Rerun Failed Levels = yes
  Allow Duplicate Jobs = no
  Cancel Queued Duplicates = yes
  Schedule = "Full Backup Archive"
}


And here's what that selection pattern query returns if I run it
manually, right now:

MySQL 127.0.0.1> SELECT DISTINCT J.JobId, J.StartTime FROM Job J JOIN
Pool P ON P.PoolId = J.PoolId WHERE P.Name = 'Full-Disk' AND J.Type =
'B' AND J.JobStatus IN ('T','W') AND J.jobBytes > 0 AND J.StartTime >
now() - interval 24 hour;
+-------+---------------------+
| JobId | StartTime           |
+-------+---------------------+
| 23476 | 2019-12-02 04:30:03 |
| 23477 | 2019-12-02 04:30:03 |
| 23478 | 2019-12-02 04:30:03 |
| 23479 | 2019-12-02 04:30:03 |
| 23480 | 2019-12-02 04:30:03 |
| 23481 | 2019-12-02 04:30:04 |
| 23482 | 2019-12-02 11:40:57 |
+-------+---------------------+
7 rows in set (0.001 sec)


And here's the job output:


02-Dec 05:00 minbar-dir JobId 23483: No JobIds found to copy.
02-Dec 11:42 minbar-dir JobId 23483: Bacula minbar-dir 9.4.4 (28May19):
  Build OS:               x86_64-pc-linux-gnu gentoo
  Prev Backup JobId:      0
  Prev Backup Job:
  New Backup JobId:       0
  Current JobId:          23483
  Current Job:            Archive_Copy.2019-12-02_05.00.00_46
  Backup Level:           Full
  Client:                 ALL
  FileSet:                "DUMMY" 2019-11-04 12:00:40
  Read Pool:              "Full-Disk" (From Job resource)
  Read Storage:           "asgard-archive" (From Job resource)
  Write Pool:             "Full-Archive" (From Pool's NextPool resource)
  Write Storage:          "asgard-archive" (From Pool's NextPool resource)
  Catalog:                "Catalog" (From Client resource)
  Start time:             02-Dec-2019 11:42:38
  End time:               02-Dec-2019 11:42:38
  Elapsed time:           1 sec
  Priority:               20
  SD Files Written:       0
  SD Bytes Written:       0 (0 B)
  Rate:                   0.0 KB/s
  Volume name(s):
  Volume Session Id:      0
  Volume Session Time:    0
  Last Volume Bytes:      0 (0 B)
  SD Errors:              0
  SD termination status:
  Termination:            Copying -- no files to copy


When I tried to rerun the Archive Copy job manually from BAT, the
Director crashed.  When I restarted the Director and manually ran it
AGAIN, this time it appears to be starting and running exactly as it
should, the console reporting:

run job="Archive Copy" fileset="DUMMY" level="Full" client="ALL"
pool="Full-Disk" storage="asgard-file" priority="10" when="2019-12-02
13:05:49" yes
Job queued. JobId=23484
02-Dec 13:05 minbar-dir JobId 23484: The following 7 JobIds were chosen
to be copied: 23476,23477,23478,23479,23480,23481,23482
02-Dec 13:05 minbar-dir JobId 23485: Copying using JobId=23476
Job=Asgard_Backup.2019-12-02_04.30.00_39
02-Dec 13:05 minbar-dir JobId 23484: Job queued. JobId=23485
02-Dec 13:05 minbar-dir JobId 23484: Copying JobId 23485 started.
02-Dec 13:05 minbar-dir JobId 23487: Copying using JobId=23477
Job=Netstore_Backup.2019-12-02_04.30.00_40
02-Dec 13:05 minbar-dir JobId 23484: Job queued. JobId=23487
02-Dec 13:05 minbar-dir JobId 23484: Copying JobId 23487 started.
02-Dec 13:05 minbar-dir JobId 23485: Start Copying JobId 23485,
Job=Archive_Copy.2019-12-02_13.05.53_05
02-Dec 13:05 minbar-dir JobId 23489: Copying using JobId=23478
Job=Babylon5_Backup.2019-12-02_04.30.00_41
02-Dec 13:05 minbar-dir JobId 23485: Using Device "FileStorage" to read.
02-Dec 13:05 minbar-dir JobId 23484: Job queued. JobId=23489
02-Dec 13:05 minbar-dir JobId 23484: Copying JobId 23489 started.
02-Dec 13:05 minbar-dir JobId 23491: Copying using JobId=23479
Job=Jumpgate_Backup.2019-12-02_04.30.00_42
02-Dec 13:05 minbar-dir JobId 23484: Job queued. JobId=23491
02-Dec 13:05 minbar-dir JobId 23484: Copying JobId 23491 started.
02-Dec 13:05 minbar-dir JobId 23493: Copying using JobId=23480
Job=Minbar_Backup.2019-12-02_04.30.00_43
02-Dec 13:05 minbar-dir JobId 23486: Max configured use duration=82,800
sec. exceeded. Marking Volume "ARCHIVE-20191104-12:00" as Used.
02-Dec 13:05 minbar-dir JobId 23486: Created new
Volume="ARCHIVE-20191202-13:05", Pool="Full-Archive", MediaType="File"
in catalog.
02-Dec 13:05 minbar-dir JobId 23486: Using Device "ArchiveCopy" to write.
02-Dec 13:05 minbar-dir JobId 23484: Job queued. JobId=23493
02-Dec 13:05 minbar-dir JobId 23484: Copying JobId 23493 started.
02-Dec 13:05 minbar-dir JobId 23495: Copying using JobId=23481
Job=Narn_Backup.2019-12-02_04.30.00_44
02-Dec 13:05 asgard-sd JobId 23486: Labeled new Volume
"ARCHIVE-20191202-13:05" on File device "ArchiveCopy" (/arcpool).
02-Dec 13:05 minbar-dir JobId 23484: Job queued. JobId=23495
02-Dec 13:05 minbar-dir JobId 23484: Copying JobId 23495 started.
02-Dec 13:05 minbar-dir JobId 23484: Copying using JobId=23482
Job=MySQL_Backup_New.2019-12-02_04.55.00_45
02-Dec 13:05 asgard-sd JobId 23486: Wrote label to prelabeled Volume
"ARCHIVE-20191202-13:05" on File device "ArchiveCopy" (/arcpool)
02-Dec 13:05 asgard-sd JobId 23485: Ready to read from volume
"FULL-20191202-04:30" on File device "FileStorage" (/spool/bacula).


Now, what this LOOKS like to me is that the Director is evaluating the
selection function at the time that the job is SCHEDULED, NOT at the
time that it actually RUNS, at which time none of the jobs it is
supposed to archive have actually completed yet, and therefore it
doesn't find anything to archive.  Which seems to me to mean that I have
to guess how long all the other jobs are going to take to run and not
SCHEDULE the job until after they have all completed.  Which is going to
be a problem if the jobs to be archived take longer than expected to
run, for any reason.

(The Director crash is a separate problem on which I have still gotten
precisely nowhere.)


It seems to me like this is a bug, and the selection criteria for a COPY
job should not be resolved until the COPY job is actually started.


Any comments...?


-- 
  Phil Stracchino
  Babylon Communications
  ph...@caerllewys.net
  p...@co.ordinate.org
  Landline: +1.603.293.8485
  Mobile:   +1.603.998.6958


_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

[Bacula-devel] Copy job unexpectedly failed because of ... timing bug?

Reply via email to