[bareos-devel] RFC: Optimization for virtual full jobs on disk storage

Burkhard Linke Wed, 02 Nov 2022 08:01:15 -0700

Hi,

we are running bareos with disk based storage (maybe adding tapes later for 
long term archives). It currently hosts ~220 million files with roughly 
900TB data in about 150 jobs. Job size range from a few files/bytes to 
several hundred TB. We are using a standard textbook always incremental 
scheme, with extra long retention file for the initial full backup.

Example:

Job {
Name = "volume-adm"
Description = "Backup of volume adm"
Accurate = true
Allow Duplicate Jobs = false
Always Incremental = true
Always Incremental Job Retention = 6 months
Always Incremental Keep Number = 60
Always Incremental Max Full Age = 5 years
Catalog = "bcf"
Client = "volume-backup"
File Set = "volume-adm"
Full Backup Pool = "AI-Consolidated"
Maximum Concurrent Jobs = 16
Messages = "Standard"
Pool = "AI-Incremental"
Priority = 10
Schedule = "Nightly"
Storage = "CEPH-Backup"
Type = backup
Write Bootstrap = "/var/lib/bareos/%c.bsr"
}

This works fine, but the virtual full jobs triggered by the consolidation
job need to be optimized.

In the current implementation (correct me if I'm wrong), the virtual full
job reads all data from the jobs to be consolidated, and stores them in the
full pool. With the configuration above, this is fine for the first run of
a virtual full job. It will processes two incremental runs, and stores
their content (minus overwritten files / deleted files) in the
'AI-Consolidated' pool. On the next run, it will read the data from the
previous virtual full run (in pool 'AI-Consolidated') + data from the next
incremental run (in pool 'AI-Incremental') and store it in the
'AI-Consolidated pool'. So data already stored in the correct pool is read
and written again. This is fine for tape based backups, but in case of disk
based backups data is copied unnecessary. For large jobs (think 100-200 TB)
it might even become unfeasible since virtual full run will take days/weeks.

I don't know the details of the volume header format, but it should be
possible to implement the following method:

for each file:
1. if source and target pool are different, use standard copy method
2. if source and target pool are not disk based, use standard copy method
3. update header in existing volume(s) to reflect changes (e.g. different
job id)
4. update database to reflect changes
5. in case of pruned files, truncate corresponding chunk in the volume(s)

It might be tricky to ensure atomicity of steps 3 + 4 to avoid
inconsistencies. Most filesystems should be able to handle sparse files
correctly, an extra "defragmentation" steps seems to be unnecessary.

This handling of virtual full jobs needs to be configurable, since it is
suitable for certain workload only (few changed files. many new files added
per job).

Any comments on this? Are there any obvious showstoppers?

Best regards,
Burkhard Linke

--
You received this message because you are subscribed to the Google Groups
"bareos-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to bareos-devel+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/bareos-devel/ec66ae48-0deb-46c5-a987-1e23b5b29879n%40googlegroups.com.

[bareos-devel] RFC: Optimization for virtual full jobs on disk storage

Reply via email to