Hi,

we are running bareos with disk based storage (maybe adding tapes later for 
long term archives). It currently hosts ~220 million files with roughly 
900TB data in about 150 jobs. Job size range from a few files/bytes to 
several hundred TB. We are using a standard textbook always incremental 
scheme, with extra long retention file for the initial full backup.

Example:

Job {
  Name = "volume-adm"
  Description = "Backup of volume adm"
  Accurate = true
  Allow Duplicate Jobs = false
  Always Incremental = true
  Always Incremental Job Retention = 6 months
  Always Incremental Keep Number = 60
  Always Incremental Max Full Age = 5 years
  Catalog = "bcf"
  Client = "volume-backup"
  File Set = "volume-adm"
  Full Backup Pool = "AI-Consolidated"
  Maximum Concurrent Jobs = 16
  Messages = "Standard"
  Pool = "AI-Incremental"
  Priority = 10
  Schedule = "Nightly"
  Storage = "CEPH-Backup"
  Type = backup
  Write Bootstrap = "/var/lib/bareos/%c.bsr"
}

This works fine, but the virtual full jobs triggered by the consolidation 
job need to be optimized.

In the current implementation (correct me if I'm wrong), the virtual full 
job reads all data from the jobs to be consolidated, and stores them in the 
full pool. With the configuration above, this is fine for the first run of 
a virtual full job. It will processes two incremental runs, and stores 
their content (minus overwritten files / deleted files) in the 
'AI-Consolidated' pool. On the next run, it will read the data from the 
previous virtual full run (in pool 'AI-Consolidated') + data from the next 
incremental run (in pool 'AI-Incremental') and store it in the 
'AI-Consolidated pool'. So data already stored in the correct pool is read 
and written again. This is fine for tape based backups, but in case of disk 
based backups data is copied unnecessary. For large jobs (think 100-200 TB) 
it might even become unfeasible since virtual full run will take days/weeks.

I don't know the details of the volume header format, but it should be 
possible to implement the following method:

for each file:
1. if source and target pool are different, use standard copy method
2. if source and target pool are not disk based, use standard copy method
3. update header in existing volume(s) to reflect changes (e.g. different 
job id)
4. update database to reflect changes
5. in case of pruned files, truncate corresponding chunk in the volume(s)

It might be tricky to ensure atomicity of steps 3 + 4 to avoid 
inconsistencies. Most filesystems should be able to handle sparse files 
correctly, an extra "defragmentation" steps seems to be unnecessary.

This handling of virtual full jobs needs to be configurable, since it is 
suitable for certain workload only (few changed files. many new files added 
per job).

Any comments on this? Are there any obvious showstoppers?

Best regards,
Burkhard Linke

-- 
You received this message because you are subscribed to the Google Groups 
"bareos-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to bareos-devel+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/bareos-devel/ec66ae48-0deb-46c5-a987-1e23b5b29879n%40googlegroups.com.

Reply via email to