Hi, we are running bareos with disk based storage (maybe adding tapes later for long term archives). It currently hosts ~220 million files with roughly 900TB data in about 150 jobs. Job size range from a few files/bytes to several hundred TB. We are using a standard textbook always incremental scheme, with extra long retention file for the initial full backup.
Example: Job { Name = "volume-adm" Description = "Backup of volume adm" Accurate = true Allow Duplicate Jobs = false Always Incremental = true Always Incremental Job Retention = 6 months Always Incremental Keep Number = 60 Always Incremental Max Full Age = 5 years Catalog = "bcf" Client = "volume-backup" File Set = "volume-adm" Full Backup Pool = "AI-Consolidated" Maximum Concurrent Jobs = 16 Messages = "Standard" Pool = "AI-Incremental" Priority = 10 Schedule = "Nightly" Storage = "CEPH-Backup" Type = backup Write Bootstrap = "/var/lib/bareos/%c.bsr" } This works fine, but the virtual full jobs triggered by the consolidation job need to be optimized. In the current implementation (correct me if I'm wrong), the virtual full job reads all data from the jobs to be consolidated, and stores them in the full pool. With the configuration above, this is fine for the first run of a virtual full job. It will processes two incremental runs, and stores their content (minus overwritten files / deleted files) in the 'AI-Consolidated' pool. On the next run, it will read the data from the previous virtual full run (in pool 'AI-Consolidated') + data from the next incremental run (in pool 'AI-Incremental') and store it in the 'AI-Consolidated pool'. So data already stored in the correct pool is read and written again. This is fine for tape based backups, but in case of disk based backups data is copied unnecessary. For large jobs (think 100-200 TB) it might even become unfeasible since virtual full run will take days/weeks. I don't know the details of the volume header format, but it should be possible to implement the following method: for each file: 1. if source and target pool are different, use standard copy method 2. if source and target pool are not disk based, use standard copy method 3. update header in existing volume(s) to reflect changes (e.g. different job id) 4. update database to reflect changes 5. in case of pruned files, truncate corresponding chunk in the volume(s) It might be tricky to ensure atomicity of steps 3 + 4 to avoid inconsistencies. Most filesystems should be able to handle sparse files correctly, an extra "defragmentation" steps seems to be unnecessary. This handling of virtual full jobs needs to be configurable, since it is suitable for certain workload only (few changed files. many new files added per job). Any comments on this? Are there any obvious showstoppers? Best regards, Burkhard Linke -- You received this message because you are subscribed to the Google Groups "bareos-devel" group. To unsubscribe from this group and stop receiving emails from it, send an email to bareos-devel+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/bareos-devel/ec66ae48-0deb-46c5-a987-1e23b5b29879n%40googlegroups.com.