Re: [bareos-devel] RFC: Optimization for virtual full jobs on disk storage

Burkhard Linke Fri, 04 Nov 2022 03:50:59 -0700

Hi,

the volume content is user controlled, so splitting it up into multiple 
jobs automatically might not solve the problem (and result in way more 
jobs..). Doing it manually is not feasible due to limited man power.


I think the best solution is extending the AI scheme similar to the data 
handling in librdd for monitoring data:
- create an initial full (already done)
- run nighty incrementals (also done)
- define "stages" for incrementals, e.g.
  - monthly
  - half year
  - year 
- merge all incrementals based on stages
- merge the highest stage with the full 
 
It somewhat resembles a full-differential-incremental scheme, but with 
merges incrementals instead of real differentials runs.

An example would be keeping daily incrementals for the last two month,  
weekly merged incrementals for the next 4 month, and monthly incrementals 
for the next 6 months. This would cover a whole year, result in 
(1+6+4*4+2*7=) 83 backup datasets per job, assuming 4 weeks and 30 days per 
month. The virtual full job merging the full and oldest incremental set can 
be scheduled to run at certain times of the year with low activity.

We would implement this scheme as an external script and disable the 
standard consolidate job. Does the python binding allow to specify which 
jobs should be merged in a virtual full job, similar to the code in the 
consolidate job?

Best regards,
Burkhard
andreas.rogge schrieb am Donnerstag, 3. November 2022 um 21:40:41 UTC+1:

> Am 02.11.22 um 16:00 schrieb Burkhard Linke:
> > Hi,
> > 
> > we are running bareos with disk based storage (maybe adding tapes later 
> > for long term archives). It currently hosts ~220 million files with 
> > roughly 900TB data in about 150 jobs. Job size range from a few 
> > files/bytes to several hundred TB. We are using a standard textbook 
> > always incremental scheme, with extra long retention file for the 
> > initial full backup.
> > 
> > Example:
> > 
> > Job {
> ...
>
> >   Always Incremental = true
> >   Always Incremental Job Retention = 6 months
> >   Always Incremental Keep Number = 60
> >   Always Incremental Max Full Age = 5 years
> ...
> > }
> > 
> > This works fine, but the virtual full jobs triggered by the 
> > consolidation job need to be optimized.
>
> When I reverse engineer your settings, it means:
> - I want to keep every Incrementals that was made in the past 6 months
> - I want to keep at least 60 of these Incrementals
> - I want to keep a full backup around that isn't older than 5 years
>
> Assuming that you're doing daily backups, you'll end up with:
> - 6 * 30 = 180 Incrementals for the last 180 days
> - 1 Full that is on average 2.75 years old
>
> As long as your Full isn't older than 5 years, consolidation will take 
> the oldest Incremental, merge it with the second oldest Incremental into 
> a new Incremental (which is now considered the oldest one).
>
> When your Full is 5 years old, consolidation will then take the Full, 
> the oldest Incremental and the second oldest Incremental and merge them 
> into a new Full.
>
> So in your setup, the oldest Incremental will grow for 4.5 years until 
> it gets merged into your Full. During that period it will get bigger and 
> bigger, which will make the consolidation take longer and longer.
>
> Long story short: you can probably save a lot of time moving data around 
> if you decrease AI Max Full Age to maybe 9 months or so, effectively 
> producing a new Full every 3 months and keeping the daily consolidation 
> a lot smaller.
>
> > In the current implementation (correct me if I'm wrong), the virtual 
> > full job reads all data from the jobs to be consolidated, and stores 
> > them in the full pool. With the configuration above, this is fine for 
> > the first run of a virtual full job. It will processes two incremental 
> > runs, and stores their content (minus overwritten files / deleted files) 
> > in the 'AI-Consolidated' pool. On the next run, it will read the data 
> > from the previous virtual full run (in pool 'AI-Consolidated') + data 
> > from the next incremental run (in pool 'AI-Incremental') and store it in 
> > the 'AI-Consolidated pool'. So data already stored in the correct pool 
> > is read and written again. This is fine for tape based backups, but in 
> > case of disk based backups data is copied unnecessary. For large jobs 
> > (think 100-200 TB) it might even become unfeasible since virtual full 
> > run will take days/weeks.
> I agree that AI (and Virtual Full in general) is pretty i/o heavy for 
> the SD. However, that's simply how it works right now.
> If you need to consolidate large jobs (as you said 100-200 TB), you'll 
> need unreasonably fast storage (i.e. a lot more than 1 GB/s) to finish 
> within a day.
> The only workaround is to cut these jobs into pieces and configure Max 
> Full Consolidations to spread the consolidation into a new full backup 
> across several days.
>
> > I don't know the details of the volume header format, but it should be 
> > possible to implement the following method:
> > 
> > for each file:
> > 1. if source and target pool are different, use standard copy method
> > 2. if source and target pool are not disk based, use standard copy method
> > 3. update header in existing volume(s) to reflect changes (e.g. 
> > different job id)
> > 4. update database to reflect changes
> > 5. in case of pruned files, truncate corresponding chunk in the volume(s)
> > 
> > It might be tricky to ensure atomicity of steps 3 + 4 to avoid 
> > inconsistencies. Most filesystems should be able to handle sparse files 
> > correctly, an extra "defragmentation" steps seems to be unnecessary.
>
> > Any comments on this? Are there any obvious showstoppers?
> To make that work we would need to
> 1. Add a new job to the catalog that references the ranges of the jobs 
> to be consolidated
> 2. Change all the block headers in existing volumes so that they belong 
> to the consolidated job
> 3. Change all the record headers so they have file ids that are strictly 
> increasing (from the new job's point of view)
> 4. Mark records that are no longer needed in some way
> 5. Rewrite the first SOS record, remove all other SOS and all but the 
> last EOS record and overwrite the last EOS record.
> 6. Remove all blocks that consist only of records that are no longer 
> needed (and make sure the SD and all tools can read volumes with nulled 
> blocks in them)
> 7. Remove the original jobs from the catalog
> 8. Provide a 100% failsafe way to resume operations 2 to 6, otherwise a 
> crash during that operation would leave all data in the job unreadable.
>
> Sounds like quite an agenda. With the current on-disk format, I wouldn't 
> dare trying it. There's just too much that can go wrong in the process.
>
> We're planning to introduce another file-based storage backend with a 
> different on-disk format next year. That would theoretically allow to do 
> virtual full backups with zero-copy for the payload (i.e. it would still 
> read and write the block and record headers, but wouldn't copy the 
> payload anymore).
> However, that backend is still vaporware today and zero-copy has not 
> even made it to the agenda yet.
>
> Best Regards,
> Andreas
>
> -- 
> Andreas Rogge andrea...@bareos.com
> Bareos GmbH & Co. KG Phone: +49 221-630693-86 <+49%20221%2063069386>
> http://www.bareos.com
>
> Sitz der Gesellschaft: Köln | Amtsgericht Köln: HRA 29646
> Komplementär: Bareos Verwaltungs-GmbH
> Geschäftsführer: S. Dühr, M. Außendorf, J. Steffens, Philipp Storz
>

-- 
You received this message because you are subscribed to the Google Groups 
"bareos-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to bareos-devel+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/bareos-devel/3732fa33-515d-4511-8cb0-f57eca15ceean%40googlegroups.com.

Re: [bareos-devel] RFC: Optimization for virtual full jobs on disk storage

Reply via email to