**Problem Description**

Trying to get the following configuration to work (the real one is more 
complex, but this is a minimal example, although it's still quite long): 

Imagine having 3 hosts, all linux. 

Host Alice: Runs bacula-fd sending local data to Bob.
Host Bob: Runs bacula-fd, bacula-sd, bacula-dir, pointing to a NAS
Host Carl: Runs Bacula-SD, pointing to local zfs filesystem. 


I'm trying to get the backups that are made daily from host 'Alice' to host 
'Bob' to be copied so they are also stored in host 'Carl' for redundancy. 
Searching, Bacula seems to have a [copy job][1] made for this. 

I cannot get these to work. Instead of copying files, these copy jobs cause the 
job (and everything that depends on it) to deadlock. 

**Configuration**

The definition of the job, and copy job, stored on host Bob at 
`Bob:/etc/bacula/jobdefs/alice.conf` looks like the following: 


Job {
  Name = "alice-job"
  Client = "alice-fd"
  FileSet = alice-fileset
  Pool = Default
  Full Backup Pool = alice-full-pool
  Incremental Backup Pool = alice-inc-pool
  Differential Backup Pool = alice-diff-pool
  Enabled = Yes
  Type = Backup
  Schedule = WeeklyCycle
  Messages = Standard
  Storage = sdef-bob

  RunScript {
    RunsOnClient = No
    Command = "/etc/bacula/runCopyJob.sh -n %n -i %i -p %p -l %l"
    RunsWhen = After
  }
  Maximum Concurrent Jobs = 63 ## Testing various values here.
}

# Supposed to copy the Full backups originally from Alice from Bob to Carl
# Not working yet...
Job {
  Name = "alice-copy-full"
  Client = "bob-fd"
  FileSet = alice-fileset
  Type = Copy
  Pool = alice-full-pool
         # Copies all jobs that have not yet been copied.
  Selection Type = PoolUncopiedJobs 
  Messages = Standard
  Enabled = Yes
         # Note: This is the source, not the target storage
  Storage = bob-storage
  Maximum Concurrent Jobs = 19 ## Trying out Pool jobs x 2 + 1
}

Job {
  Name = "alice-copy-inc"
  Client = "bob-fd"
  FileSet = alice-fileset
  Type = Copy
  Pool = alice-inc-pool
  Selection Type = PoolUncopiedJobs 
  Messages = Standard
  Enabled = Yes
  Storage = bob-storage 
  Maximum Concurrent Jobs = 63
}
Job {
  Name = "alice-copy-diff"
  Client = "bob-fd"
  FileSet = alice-fileset
  Type = Copy
  Pool = alice-diff-pool
  Selection Type = PoolUncopiedJobs 
  Messages = Standard
  Enabled = Yes
  Storage = bob-storage # Note: This is the source, not the target storage
  Maximum Concurrent Jobs = 63
}



I use a little bash script `runCopyJob.sh` that manually calls the copy job 
after the regular job completes. This does the same as running the job manually 
via the console. (Suppose the job is #23 in the list, you use the commandline 
to run: `bconsole`, `run` then `23`)

The `WeeklyCycle` schedule does a Full backup each month on the 1st, then 
incremental/differential backups daily on weekdays. 

Side note: `63 = 31 * 2 + 1`. There are 31 days in the longest month. Each copy 
task spawns two jobs, and the main task is another job. Thus there are at most 
63 jobs if the 'copy job' functionality is enabled for a client on the 31st if 
backups run at most daily. 

I've been experimenting with the Maximum Concurrent Jobs value but to no avail. 
It can be put in many places, and I'm not sure which overrides which. Trial and 
Error for the endless combinations possible is taking too long.

Next, there's also some Backup pools defined for Alice (a total of 6). These 
are defined on Bob at `Bob:/etc/bacula/pooldefs/` 


Pool {
  Name = alice-full-pool
  Pool Type = Backup
  Recycle = yes
  AutoPrune = yes
  Volume Retention = 6 months
  Maximum Volume Jobs = 1
  Maximum Volumes = 9
  Label Format = alice-full-
  Next Pool = alice-full-pool-carl
}

Pool {
  Name = alice-inc-pool
  Pool Type = Backup
  Recycle = yes
  AutoPrune = yes
  Volume Retention = 20 days
  Maximum Volume Jobs = 6
  Maximum Volumes = 7
  Label Format = alice-inc-
  Next Pool = alice-inc-pool-carl
}

Pool {
  Name = alice-diff-pool
  Pool Type = Backup
  Recycle = yes
  AutoPrune = yes
  Volume Retention = 40 days
  Maximum Volume Jobs = 1
  Maximum Volumes = 10
  Label Format = alice-diff-
  Next Pool = alice-diff-pool-carl
}

Pool {
  Name = alice-full-pool-carl
  Pool Type = Backup
  Recycle = yes
  AutoPrune = yes
  Volume Retention = 6 months
  Maximum Volume Jobs = 1
  Maximum Volumes = 9
  Label Format = alice-full-
  Storage = sdef-carl
}

Pool {
  Name = alice-inc-pool-carl
  Pool Type = Backup
  Recycle = yes
  AutoPrune = yes
  Volume Retention = 20 days
  Maximum Volume Jobs = 6
  Maximum Volumes = 7
  Label Format = alice-inc-
  Storage = sdef-carl
}

Pool {
  Name = alice-diff-pool-carl
  Pool Type = Backup
  Recycle = yes
  AutoPrune = yes
  Volume Retention = 40 days
  Maximum Volume Jobs = 1
  Maximum Volumes = 10
  Label Format = alice-diff-
  Storage = sdef-carl
}


The director (@ Bob) is configured via `Bob:/etc/bacula/bacula-dir.conf`like 
so: 


Director {
  Name = bob-dir
  DIRport = 9101
  QueryFile = "/etc/bacula/query.sql"
  WorkingDirectory = "/opt/bacula/working"
  PidDirectory = "/var/run"
  Maximum Concurrent Jobs = 63 ## Trying out higher values here...
  Password = "dir-bob-password"
  Messages = Daemon
  TLS Certificate = /etc/bacula/certs/Bob.example.com.crt
  TLS Key = /etc/bacula/certs/Bob.example.com.key
  TLS CA Certificate File= /etc/bacula/certs/myca.crt
  TLS Enable = yes
  TLS Require = yes
  TLS Verify Peer = yes
  TLS Allowed CN = bob.example.com
}
Schedule {
  Name = "WeeklyCycle"
  Run = Full 1st sun at 23:05
  Run = Differential 2nd-5th sun at 23:05
  Run = Incremental mon-sat at 23:05
}


The Storage Demon at Bob is configured via `Bob:/etc/bacula/bacula-sd.conf`: 


Storage {
  Name = bob-sd
  SDPort = 9103                  # Director's port
  WorkingDirectory = "/opt/bacula/working"
  Pid Directory = "/var/run"
  Maximum Concurrent Jobs = 40
  SDAddress = bob.example.com
  TLS Certificate = /etc/bacula/certs/bob.example.com.crt
  TLS Key = /etc/bacula/certs/bob.example.com.key
  TLS CA Certificate File = /etc/bacula/certs/myca.crt
  TLS Enable = yes
  TLS Require = yes
  TLS Verify Peer = yes
  TLS Allowed CN = bob.example.com, carl.example.com

}

Director {
  Name = bob-dir
  Password = "sd-bob-dir-bob-pw"
}

Device {
  Name = bob-storage
  Media Type = File
  Archive Device = /FileStorage01 # Path where the NAS is mounted. 
  LabelMedia = yes;
  Random Access = Yes; 
  AutomaticMount = yes;
  RemovableMedia = no;
  AlwaysOpen = no;
  Maximum Concurrent Jobs = 20
}
 

The storage demon at 'carl' is configured with the same file on this host; 



Storage {                             # definition of myself
  Name = carl-sd
  SDPort = 9103                  # Director's port
  WorkingDirectory = "/var/lib/bacula"
  Pid Directory = "/run/bacula"
  Plugin Directory = "/usr/lib/bacula"
  Maximum Concurrent Jobs = 63
  SDAddress = 0.0.0.0

  TLS Enable = Yes
  TLS Require = Yes
  TLS CA Certificate File = /etc/bacula/certs/myca.crt
  TLS Certificate = /etc/bacula/certs/carl.example.com.crt
  TLS Key = /etc/bacula/certs/carl.example.com.key

  TLS Verify Peer = Yes
  TLS Allowed CN = bob.example.com
}

#
# List Directors who are permitted to contact Storage daemon
#
Director {
  Name = bob-dir
  Password = "sd-carl-dir-bob-pw"
  TLS Enable = Yes
  TLS Require = Yes
  TLS CA Certificate File = /etc/bacula/certs/myca.crt
  TLS Certificate = /etc/bacula/certs/carl.example.com.crt
  TLS Key = /etc/bacula/certs/carl.example.com.key
}

Device {
  Name = carl-storage
  Media Type = File
  Archive Device = /zfs1/external/bob/backups
  LabelMedia = yes;                   # lets Bacula label unlabeled media
  Random Access = Yes;
  AutomaticMount = yes;               # when device opened, read it
  RemovableMedia = no;
  AlwaysOpen = no;
  Maximum Concurrent Jobs = 20
}


Both storage demons are in a file `/etc/bacula/storagedefs/file.conf` so the 
director over at Bob can find them and orchestrate the file transfer. 


Storage {
  Name = sdef-bob
# Do not use "localhost" here
# Note: Use a fully qualified name here
  Address = bob.example.com   
  SDPort = 9103
  Password ="sd-bob-dir-bob-pw"
  Device = bob-storage
  Media Type = File
  Maximum Concurrent Jobs = 63
}

Storage {
  Name = sdef-carl
  Address = carl.example.com
  SDPort = 9103
  Password = "sd-carl-dir-bob-pw"
  Device = carl-storage
  Media Type = File
  Maximum Concurrent Jobs = 63


  TLS Enable = yes
  TLS Require = yes
  TLS Certificate = /etc/bacula/certs/carl.example.com.crt
  TLS Key = /etc/bacula/certs/carl.example.com.key
  TLS CA Certificate File = /etc/bacula/certs/myca.crt
}


The original backup and the copy job are done by FileDaemons. The configuration 
for Bob's file daemon is at `bob:/etc/bacula/bacula-fd.conf`:

 
#
# List Directors who are permitted to contact this File daemon
#
Director {
  Name = bacula01-dir
  Password = "fd-bob-dir-bob-pw"
  Address = bob.example.com
  TLS Certificate = /etc/bacula/certs/bob.example.com.crt
  TLS Key = /etc/bacula/certs/bob.example.com.key
  TLS CA Certificate File = /etc/bacula/certs/myca.crt
  TLS Enable = yes
  TLS Require = yes
#  TLS Allowed CN = “bob.example.com”
}

#
# Restricted Director, used by tray-monitor to get the
#   status of the file daemon
#
Director {
  Name = bacula01-mon
  Password = ""
  Monitor = yes
}

#
# "Global" File daemon configuration specifications
#
FileDaemon {                          # this is me
  Name = bob-fd
  FDport = 9102                  # where we listen for the director
  WorkingDirectory = /opt/bacula/working
  Pid Directory = /var/run
  Maximum Concurrent Jobs = 20
  FDAddress = bob.example.com

  TLS Enable = yes
  TLS Require = yes
  TLS CA Certificate File = /etc/bacula/certs/myca.crt
  TLS Certificate = /etc/bacula/certs/bob.example.com.crt
  TLS Key = /etc/bacula/certs/bob.example.com.key
}


And Alice's FileDaemon is configured at `alice:/etc/bacula/bacula-fd.conf`


#
# List Directors who are permitted to contact this File daemon
#
Director {
  Name = bob-dir
  Password = "alice-fd-bob-dir-pw"  # same pass as on bob 
/etc/bacula/clientdefs/HOSTNAME-fd.conf
  TLS Enable = yes
  TLS Require = yes
  TLS Verify Peer = yes
  TLS CA Certificate File = /etc/bacula/certs/myca.crt
  TLS Certificate = /etc/bacula/certs/alice.example.com.crt
  TLS Key = /etc/bacula/certs/alice.example.com.key
}

#
# Restricted Director, used by tray-monitor to get the
#   status of the file daemon
#
Director {
  Name = bacula01-mon
  Password = "RandomSecretDataHere"
  Monitor = yes
}

#
# "Global" File daemon configuration specifications
#
FileDaemon {                          # this is me
  Name = "alice-fd"
  FDport = 9102                  # where we listen for the director
  WorkingDirectory = /opt/bacula/working
  Pid Directory = /var/run
  Maximum Concurrent Jobs = 20
# Plugin Directory = /usr/lib
  FDAddress = alice.example.com

  TLS Enable = yes
  TLS Require = yes
  TLS CA Certificate File = /etc/bacula/certs/myca.crt
  TLS Certificate = /etc/bacula/certs/bob.example.com.crt
  TLS Key = /etc/bacula/certs/bob.example.com.key

}

# Send all messages except skipped files back to Director
Messages {
  Name = Standard
  director = bob-dir = all, !skipped, !restored
}



The director knows about Alice's File Daemon because of what's in 
`bob:/etc/bacula/clientDefs/alice.example.com-fd.conf`: 


Client {
  Name = "alice-fd"
  Address = alice.example.com
  FDPort = 9102
  Catalog = MyCatalog
  Password = "alice-fd-bob-dir-pw"
  File Retention = 30 days
  Job Retention = 6 months
  AutoPrune = yes
  TLS Enable = yes
  TLS Require = yes
  TLS Certificate = /etc/bacula/certs/alice.example.com.crt
  TLS Key = /etc/bacula/certs/alice.example.com.key
  TLS CA Certificate File = /etc/bacula/certs/myca.crt
}


Similarly for its own file daemon at 
`/etc/bacula/clientdefs/bob.example.com-fd.conf`: 


Client {
  Name = "bob-fd"
  Address = bob.example.com
  FDPort = 9102
  Catalog = MyCatalog
  Password = "bob-fd-bob-dir-pw"
  File Retention = 14 days
  Job Retention = 1 months
  AutoPrune = yes
  TLS Enable = yes
  TLS Require = yes
  TLS Certificate = /etc/bacula/certs/bob.example.com.crt
  TLS Key = /etc/bacula/certs/bob.example.com.key
  TLS CA Certificate File = /etc/bacula/certs/myca.crt
}


**Problem Symptoms**

When running this type of copy job, the job never completes

Running , within `bconsole> status dir`, produces output akin to: 


 JobId  Type Level     Files     Bytes  Name              Status
======================================================================
 63642  Copy Full          0         0  alice-copy-inc is waiting on Storage 
"sdef-bob"
 63643  Copy Full          0         0  alice-copy-inc is running
 63644  Back Full          0         0  alice-job is running
 63645  Copy Full          0         0  alice-copy-inc is waiting on Storage 
"sdef-bob"
 63646  Back Full          0         0  alice-job is waiting on Storage 
"sdef-carl"
 63647  Copy Full          0         0  alice-copy-inc is waiting on Storage 
"sdef-bob"
 63648  Back Full          0         0  alice-job is waiting on Storage 
"sdef-carl"
 63649  Copy Full          0         0  alice-copy-inc is waiting on Storage 
"sdef-bob"
 63650  Back Full          0         0  alice-job is waiting on Storage 
"sdef-carl"
 63651  Copy Full          0         0  alice-copy-inc is waiting on Storage 
"sdef-bob"
 63652  Back Full          0         0  alice-job is waiting on Storage 
"sdef-carl"
 63653  Copy Full          0         0  alice-copy-inc is waiting on Storage 
"sdef-bob"
 63654  Back Full          0         0  alice-job is waiting on Storage 
"sdef-carl"
 63655  Copy Full          0         0  alice-copy-inc is waiting on Storage 
"sdef-bob"
 63656  Back Full          0         0  alice-job is waiting on Storage 
"sdef-carl"
 63657  Copy Full          0         0  alice-copy-inc is waiting on Storage 
"sdef-bob"
 63658  Back Full          0         0  alice-job is waiting on Storage 
"sdef-carl"
 63659  Copy Full          0         0  alice-copy-inc is waiting on Storage 
"sdef-bob"
 63660  Back Full          0         0  alice-job is waiting on Storage 
"sdef-carl"
 

The number of lines depends on the day of the month, two for every incremental 
backup plus 1 `2n+1` created, except on the first, where the last backup is a 
Full backup; in that case, it should give 7 lines. Once the first copy job has 
run and all the data is transferred, the next job should only ever produce 3 
lines. 

Noticing how everyone is waiting on these storage, running `status storage` for 
say carl's storage outputs: 


carl-sd Version: 9.6.7 (10 December 2020) x86_64-pc-linux-gnu debian 
bullseye/sid
Daemon started 21-Jun-22 12:08. Jobs: run=0, running=2.
 Heap: heap=401,408 smbytes=722,480 max_bytes=722,674 bufs=674 max_bufs=676
 Sizes: boffset_t=8 size_t=8 int32_t=4 int64_t=8 mode=0,0 newbsr=0
 Res: ndevices=2 nautochgr=0

Running Jobs:
Writing: Full Backup job alice-job JobId=63555 Volume=""
    pool="alice-Inc-Pool-carl" device="carl-storage" 
(/zfs1/external/bob/backups)
    spooling=0 despooling=0 despool_wait=0
    Files=0 Bytes=0 AveBytes/sec=0 LastBytes/sec=0
    FDReadSeqNo=5 in_msg=5 out_msg=4 fd=6
Writing: Full Backup job alice-job JobId=63644 Volume=""
    pool="alice-Inc-Pool-carl" device="carl-storage" 
(/zfs1/external/bob/backups)
    spooling=0 despooling=0 despool_wait=0
    Files=0 Bytes=0 AveBytes/sec=0 LastBytes/sec=0
    FDReadSeqNo=5 in_msg=5 out_msg=4 fd=15
====

Jobs waiting to reserve a drive:
====

Terminated Jobs:
 JobId  Level    Files      Bytes   Status   Finished        Name
===================================================================
 63350  Full          0         0   Created  20-Jun-22 14:21 alice-job
 63352  Full          0         0   Created  20-Jun-22 14:21 alice-job
 63354  Full          0         0   Created  20-Jun-22 14:21 alice-job
 63355  Full          0         0   Created  20-Jun-22 14:21 alice-job
 63358  Full          0         0   Cancel   20-Jun-22 14:37 alice-job
 63360  Full          0         0   Cancel   20-Jun-22 14:55 alice-job
 63384  Full          0         0   Cancel   20-Jun-22 15:27 alice-job
 63459  Full          0         0   Cancel   21-Jun-22 09:57 alice-job
 63483  Full          0         0   Cancel   21-Jun-22 09:58 alice-job
 63396  Full          0         0   Cancel   21-Jun-22 12:08 alice-job
====

Device status:

Device File: "carl-storage" (/zfs1/external/bob/backups) is not open.
   Device is BLOCKED waiting to create a volume for:
       Pool:        alice-Inc-Pool-carl
       Media type:  File
   Available Space=28.15 TB
==
====

Used Volume status:
====

====


No, manually creating a volume (which is supposed to automatically create 
itself) does not make the problem go away. 

It's possible to stop the director and run it in debug mode by redirecting its 
output. 
By executing `bacula-dir -d 201 -f > /var/log/bacula/run_debug_2.log 2>&1 &` 
it's possible to view some more information alongside the C code. 

Except... that results in hundreds of megabytes of logs which won't fit on this 
page. I'd like to know more what to look for.. 

One thing I found earlier was that at some point the code fails due to a 
variable `rncj` not being high enough. I looked at the C file in question, and 
it contains the following snippet; 


bool inc_read_store(JCR *jcr)
{
   P(rstore_mutex);
   int num = jcr->rstore->getNumConcurrentJobs();
   int numread = jcr->rstore->getNumConcurrentReadJobs();
   int maxread = jcr->rstore->MaxConcurrentReadJobs;
   if (num < jcr->rstore->MaxConcurrentJobs &&
       (jcr->getJobType() == JT_RESTORE ||
        numread == 0     ||
        maxread == 0     ||     /* No limit set */
        numread < maxread))     /* Below the limit */
   {
      num++;
      numread++;
      jcr->rstore->setNumConcurrentReadJobs(numread);
      jcr->rstore->setNumConcurrentJobs(num);
      Dmsg1(200, "Inc rncj=%d\n", num);
      V(rstore_mutex);
      return true;
   }
   V(rstore_mutex);
   return false;
}



The line `Inc rncj` would be missing, while the line before (in the job 
scheduling process) would be printed. Therefore the Job isn't allowed to 
proceed because it couldn't read. So I tried increasing the MaxConcurrentJobs 
variable, but it'll just keep cycling and never actually start the job, and 
rncj seems to increase without bound (there isn't even 63 total jobs queued). 

  [1]: https://www.bacula.org/11.0.x-manuals/en/main/Migration_Copy.html

_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to