Hello,

czw., 3 mar 2022 o 12:09 egoitz--- via Bacula-devel <
bacula-devel@lists.sourceforge.net> napisał(a):

> Good morning,
>
>
> I know Bacula enterprise provides deduplicacion plugins, but sadly we
> can't afford it. No problem, we will try to create an open source
> deduplication plugin for bacula file daemon. I would use rdiff (part of
> librsync) for delta patching and signature generation.
>
What signatures rdiff is using?

> I would love to create a Bacula plugin for deduplicating content at fd
> level. This way, even if the backup is sent crypted by fd to sd, the
> deduplication could be done obtaining the best results as the deduplication
> takes place when the files are not crypted yet.
>
Yes, for proper encryption you would always get different bits for the same
data block making deduplication totally useless. :)

> The deduplication, would only be applied to files, let's say larger than
> 10GB.
>
???

I designed Bacula deduplication to handle blocks (files) larger than 1k
because indexing overhead for such small blocks was too high. The larger
the block you use the lower chance to get a good deduplication ratio. So it
is a trade-off - small blocks == good deduplication ratio but higher
indexing overhead; larger blocks == weak deduplication ratio but lower
indexing overhead. So it was handling block levels from 1K up to 64k (the
default bacula block size, but could be extended to any size).

> If you don't mind, I would like to share with you my ideas, in order to at
> least know, "this all" is a possible way.
>
>
> My idea is basically :
>
>
> *- When doing a backup :*
>
>
> ++ Check the backup level we are running. I suppose that asking bVarLevel
> to getBaculaValue()
>
Deduplication should be totally transparent to the backup level. You want
to deduplicate data, especially for largest full level backups, right?

> ++ In startBackupFile() I suppose it gives me file size info (or if at
> least gives me the name and I'll do an stat() in some manner), get the file
> size.
>
No. The standard "Bacula command Plugin API" expects that a plugin will
return a file stat info to backup.

> +++If it's a full level and bigger than 10GB, obtain the file signature
> and finally store that new (previously non existing) signature (written in
> a file with a known nomenclature based on ORIGINAL_FILE's name), plus the
> whole ORIGINAL_FILE (the one we have generated the signature from) in
> Bacula tapes. Should I need to say to Bacula, to re-read the directory for
> being able to backup generated file signatures?. They weren't until know we
> have generated a file that contains ORIGINAL_FILE signature.
>
Why do you call it a "deduplication plugin"? Above is a functionality
described by the Delta plugin which supports so-called "block level
incremental". Which is _NOT_ deduplication. This "block level incremental"
tries to backup blocks inside a single file which changed between backups.
It does not deduplicate the backup stream in any sense. For two identical
files which change in the same way Delta plugin will do data backup twice
leaving data duplication in place.

In the case of the Delta plugin which uses the exact procedure and library
which you describe above you should use an "Option Plugin API".

> +++If it's an inc level and a previous signature of ORIGINAL_FILE file
> exists (I would know because they will have a known nomenclature based on
> ORIGINAL_FILE's name), with the previous signature plus the new state of
> the file (the new file state I mean), create a patch. Later obtain again,
> the file signature in the new status. Finally store that new signature plus
> the patch in Bacula tapes. Finally return a bRC_Skip of the ORIGINAL_FILE
> (because we are going to copy a delta patch and a signature). If I return a
> bRC_Skip to here... would the fd, skip this file, but see the signatures
> and delta patches generated before retuning the bRC_Skip?. Or should I ask
> to fd, in some manner, to re-read the directory?.
>
It sounds like an exact step by step description of the Delta plugin.

So, now I understand why you want to handle files > 10G only. :)

>
> As you would assume in the incremental backups, I'm not storing the
> filename as its in the filesystem. It should more or less the following way
> :
>
>
> In a full level backup :
>
> ++ BEFORE THE BACKUP  :
>
> *BACKED SERVER'S FS <----> BACULA "VIRTUAL TAPE" CONTENT*
>
> ORIGINAL_FILE            <--->
>
> ++ AFTER THE BACKUP :
>
> *BACKED SERVER'S FS <----> BACULA "VIRTUAL TAPE" CONTENT*
>
> ORIGINAL_FILE + SIGNATURE FILE           <--->  ORIGINAL FILE + SIGNATURE
> FILE
>
>
> In the next incremental level backup :
>
> ++ BEFORE THE BACKUP  :
>
> *BACKED SERVER'S FS <----> BACULA "VIRTUAL TAPE" CONTENT*
>
> NEW_STATE_ORIGINAL_FILE + SIGNATURE FILE GENERATED THE LAST FULL DAY
> <--->  *FROM THE FULL BACKUP*(ORIGINAL FILE + SIGNATURE FILE)
>
> ++ AFTER THE BACKUP :
>
> *BACKED SERVER'S FS <----> BACULA "VIRTUAL TAPE" CONTENT*
>
> NEW_STATE_ORIGINAL_FILE + SIGNATURE FILE OF NEW_STATE_ORIGINAL_FILE <--->  
> *FROM
> THE FULL BACKUP*(ORIGINAL FILE + SIGNATURE FILE) + PATCH FILE + SIGNATURE
> FILE OF NEW_STATE_ORIGINAL_FILE
>
>
> *- When restoring a backup :*
>
> If the restored files nomenclature is  (for example...)
> ORIGINAL_FILE-SIGNATURE- OR ORIGINAL_FILE-PATCH that would mean (I assume I
> could see in the filename to be restored in startRestoreFile() because it
> has accesible the filename), we have backed up deltas of ORIGINAL_FILE in
> the incremental backups.
>
> So, let's write to a plain text file with this path inside it, in order
> for later, in a post restore job (or even bEventEndBackupJob event of the
> api?), to apply the patches in that path, to the ORIGINAL_FILE obtainted
> from the own name of the patch files. Finally after patching job done,
> remove signature files and patch files. Obviously leaving the last status
> of ORIGINAL_FILE at the restored date.
>
>
> So, at this point, I would be very very thankful :) :) :) if some
> experienced developer, could give me some idea or if can see something is
> wrong or should achieved in some other manner or with other plugin api
> functions.....
>
IMVHO, the Delta plugin should be best handled with "Options Plugin API"
(as it is with current Delta Plugin) and not the "Command Plugin API" as
most of the backup functionality will be provided by Bacula itself.

best regards

BTW. I think a Delta plugin available in BEE is fairly cheap compared to
full deduplication options.
-- 
Radosław Korzeniewski
rados...@korzeniewski.net
_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to