I have a bacula 15.0.2 instance running in a rocky linux 9 vm on a synology
appliance. Backups are stored locally on the synology appliance. Local
volumes are copied offsite to a backblaze b2 s3-compatible endpoint using
copy job selection type PoolUncopiedJobs. For ransomware protection, object
locking is enabled on the b2 account. Because of the focus on cloud storage
as a secondary backup destination, I have enforced 'jobs per volume = 1'
both locally and in the cloud. I have 3 cloud pools / buckets, for Full,
Diff, and Inc backups. Each has their own retention and object lock period.
Backblaze's file lifecycle settings are configured to only keep the most
recent copy of the file, with the retrospectively obvious exception that it
will respect the object lock period before removing previous copies of a
file. Remember that last fact, as it is going to be important for later. :(

Recently, we've had some problems where old jobs were being copied into the
cloud repeatedly. With object locking enabled, you can imagine how
unpleasant that could become.

Here's my understanding of what was happening, based on my review of
bacula.log:
As we reached the end of retention for full backups (for the first time),
the older full backup volumes weren't being pruned. The old local job would
be deleted, but when the copy jobs launched they would detect those old
full jobs as a job that hadn't yet been copied (because the old copy job
was pruned). The new copy job would run against the old volume (I'm
guessing this is how it worked, since the jobs were certainly pruned), and
the local volume for these old fulls would be copied offsite... again. The
recently copied job would be pruned again (because it inherited retention
from its source job). Later, my copy jobs would run again. More copy jobs
would spawn. My review of the bacula.log file showed that this had been
occurring sporadically since August, with the issue rapidly escalating in
September. For September, I believe it was basically copying expired fulls,
uploading them over the course of a couple days, pruning the recently
uploaded expired fulls, then launching new copy jobs to begin the noxious
cycle all over again that evening. Backblaze was allowing the deletion of
truncated part files as the cloud volumes were reused in the cycle, but was
keeping all past copies of the part files that hadn't yet reached the
object lock duration.

I previously ran a number of test full backups (In June, I believe), some
of which might not have been deleted, and some of which may have been
uploaded to the B2 cloud.

I suspect that some of those jobs recently fell off, and the jobs were then
pruned. The volumes weren't pruned and truncated, and because I had more
volumes than numClients*numTimesExpectedToRunWithinRetentionPeriod, the
excess volumes weren't promptly truncated by bacula for use in newer jobs.
I'm sure some of the excess volumes WERE truncated and reused, but not all
of them. At least 1 or 2 of the original 'full' backup volumes did remain,
both in bacula's database, and on disk. These jobs were subsequently
re-copied up to B2 cloud storage multiple times.

I've created scripts and admin jobs to prune all pools (prune allpools yes,
then select '3' for volumes) with job priority 1 / schedule daily, and then
truncate all volumes for each storage (truncate allpools storage=$value),
with priority 2 / schedule daily. My goal is to ensure that no copy job can
run against an expired volume. I also want to remove expired volumes from
cloud storage as quickly as possible.

I expect this has probably solved my problem, with the exception that I
have now created a new problem, where a task runs before any of my backups
with the express purpose of destroying any expired backups. If backups
haven't ran successfully for some time (which will be a colossal failure of
monitoring, to be sure), then it's possible for bacula to merrily chew all
the past backups up, leaving us with nothing. As it stands, 3 months of
full backups should be extant before anything reaches its expiry period,
and is then purged. Still, I don't like having a hungry monster chewing up
old backups at the end of my backup trail, at least not without some sort
of context sensitive monitoring. I don't know how a script could possibly
monitor the overall health of bacula's backups and know whether or not it
is dangerous to proceed. I suspect it cannot.

Further remediation / prevention:
I have enabled data storage caps for our B2 cloud account. I can't set caps
where I want them because our current usage is higher than that level, but
I have set them to prevent substantial further growth. I've set a calendar
reminder in 3 months to lower the cap to 2x our usual usage.
With Backblaze B2, I cannot override the object lock unless I delete the
account. I am prepared to do that if the usage were high enough, but I
think it's probably better to just ride this out for the few months it
would require to wait for these backups to fall off, vs deleting the
account, recreating the account, losing our cloud backups, recopying all
the local volumes to the cloud, and crucially, all the object lock periods
would be incorrect, as the object lock periods would be just getting
started, but the jobs would be partway through their retention period.

My questions:
What are the best practices around the use of prune and truncate scripts
like those I've deployed?
Does my logic about the cause of all my issues seem reasonable?
Any ideas or tips to prevent annoying problems like this in the future?
Is my daily prune / truncate script a horrible, horrible idea that is going
to destroy everything I love?
Any particular way to handle this differently, perhaps more safely? It does
occur to me that I could delete every volume with status 'purged', bringing
my volume count much closer to 'number of volumes I usually need within my
retention period', thereby improving the odds that routine operation will
truncate any previously expired volumes, without the need for a routine
truncate script. Not sure if this would really make any substantial
difference for my 'somehow we have failed to monitor bacula and it has
chewed its entire tail off while we weren't looking' scenario.

My bacula-dir.conf, selected portions:
Job {
  Name = "admin-prune-allpools-job"   # If we don't remove expired volumes,
we will fall into an infinite copy loop
  Type = admin                        # where copy jobs keep running
against random expired local jobs, whose volumes should
  Level = Full                        # have been pruned already.
  Schedule = "Daily"                  # Step 1: prune, so volumes can be
truncated.
  Storage = "None"                    # step 2: truncate, so volumes are
reduced to their minimum size on disk.
  Fileset = "None"
  Pool = "None"
  JobDefs = "Synology-Local"
  Runscript {
     RunsWhen = before
     RunsOnClient = no  # Default yes, there is no client in an Admin job &
Admin Job RunScripts *only* run on the Director :)
     Command = "/opt/bacula/etc/prune-allpools.sh" # This can be `Console`
if you wish to send console commands, but don't use this*
  }
  Priority = 1
}

Job {
  Name = "admin-truncate-volumes-job"   # If we don't remove expired
volumes, we will fall into an infinite copy loop
  Type = admin                          # where copy jobs keep running
against random expired local jobs, whose volumes should
  Level = Full                          # have been pruned already.
  Schedule = "Daily"                    # Step 1: prune, so volumes can be
truncated.
  Storage = "None"                      # step 2: truncate, so volumes are
reduced to their minimum size on disk.
  Fileset = "None"
  Pool = "None"
  JobDefs = "Synology-Local"
  Runscript {
     RunsWhen = before
     RunsOnClient = no  # Default yes, there is no client in an Admin job &
Admin Job RunScripts *only* run on the Director :)
     Command = "/opt/bacula/etc/truncate-volumes.sh" # This can be
`Console` if you wish to send console commands, but don't use this*
  }
  Priority = 2
}


Job {
  Name = "admin-copy-control-launcher-job"   # Launching copy control jobs
from a script so the 'pooluncopiedjobs' lookup will be fresh as of when the
daily backups have completed.
  Type = admin                               # Otherwise the jobs selected
for copy will date from when the copy control job was queued.
  Level = Full
  Schedule = "Copy-End-Of-Day"
  Storage = "None"
  Fileset = "None"
  Pool = "None"
  JobDefs = "Synology-Local"
  Runscript {
     RunsWhen = before
     RunsOnClient = no  # Default yes, there is no client in an Admin job &
Admin Job RunScripts *only* run on the Director :)
     Command = "/opt/bacula/etc/copy-control-launcher.sh" # This can be
`Console` if you wish to send console commands, but don't use this*
  }
  Priority = 20
}

Job {
  Name = "Copy-Control-Full-job"   # Launching a different copy control job
for each backup level to prevent copy control jobs with different pools
from being cancelled as duplicates.
  Type = Copy
  Level = Full
  Client = td-bacula-fd
  Schedule = "Manual"
  FileSet = "None"
  Messages = Standard
  Pool = Synology-Local-Full
  Storage = "Synology-Local"
  Maximum Concurrent Jobs = 4
  Selection Type = PoolUncopiedJobs
  Priority = 21
  JobDefs = "Synology-Local"
}

Job {
  Name = "Backup-delegates-cad1-job"
  Level = "Incremental"
  Client = "delegates-cad1-fd"
  Fileset = "Windows-All-Drives-fs"
  Storage = "Synology-Local"
  Pool = "Synology-Local-Inc"
  JobDefs = "Synology-Local"
  Priority = 13
}

JobDefs {
  Name = "Synology-Local"
  Type = "Backup"
  Level = "Incremental"
  Messages = "Standard"
  AllowDuplicateJobs = no # We don't want duplicate jobs. What action is
taken is determined by the variables below.
                          # See flowchart Figure 23.2 in Bacula 15.x Main
manual, probably page 245 in the PDF.
  CancelLowerLevelDuplicates = yes # If a lower level job (example: inc) is
running or queued and a higher level job (Example: diff or full) is added
to the queue, then the lower level job will be cancelled.
  CancelQueuedDuplicates = yes # This will cancel any queued duplicate jobs.
  Pool = "Synology-Local-Inc"
  FullBackupPool = "Synology-Local-Full"
  IncrementalBackupPool = "Synology-Local-Inc"
  DifferentialBackupPool = "Synology-Local-Diff"
  Client = "td-bacula-fd"
  Fileset = "Windows-All-Drives-fs"
  Schedule = "Daily"
  WriteBootstrap = "/mnt/synology/bacula/BSR/%n.bsr"
  MaxFullInterval = 30days
  MaxDiffInterval = 7days
  SpoolAttributes = yes
  Priority = 10
  ReRunFailedLevels = yes # (if previous full or diff failed, current job
will be upgraded to match failed job's level). a failed job is defined as
one that has not terminated normally, which includes any running job of the
same name. Cannot allow duplicate queued jobs. Will also trigger on fileset
changes, regardless of whether you used 'ignorefilesetchanges'.
  RescheduleOnError = no
  Accurate = yes
}


Schedule {
  Name = "Daily"
  Run = at 20:05
}
Schedule {
  Name = "Copy-End-Of-Day"
  Run = at 23:51
}

Pool {
  Name = "Synology-Local-Full"
  Description = "Synology-Local-Full"
  PoolType = "Backup"
  NextPool = "B2-TD-Full"
  LabelFormat = "Synology-Local-Full-"
  LabelType = "Bacula"
  MaximumVolumeJobs = 1
#  MaximumVolumeBytes = 50G
#  MaximumVolumes = 100
  Storage = "Synology-Local"
  ActionOnPurge=Truncate
  FileRetention = 95days
  JobRetention = 95days
  VolumeRetention = 95days
}

Pool {
  Name = "B2-TD-Full"
  Description = "B2-TD-Full"
  PoolType = "Backup"
  LabelFormat = "B2-TD-Full-"
  LabelType = "Bacula"
  MaximumVolumeJobs = 1
  FileRetention = 95days
  JobRetention = 95days
  VolumeRetention = 95days
  Storage = "B2-TD-Full"
  CacheRetention = 1minute
  ActionOnPurge=Truncate
}

Regards,
Robert Gerber
402-237-8692
[email protected]
_______________________________________________
Bacula-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to