On 22/04/2021 08:31, griffin tucker wrote:
On Thu, 22 Apr 2021 at 17:17, Dominic Raferd <domi...@timedicer.co.uk> wrote:

On 22/04/2021 08:07, Dominic Raferd wrote:
On 22/04/2021 08:01, griffin tucker wrote:
I've tried using deduplication, but only get about 6gb savings per 30gb.

I intend on using squashfs on top of rdiff-backup, btrfs is just being
used temporarily.

On Thu, 22 Apr 2021 at 16:41, Dominic Raferd
<domi...@timedicer.co.uk> wrote:
On 22/04/2021 07:03, griffin tucker wrote:
i have a collection of the last 5 monthly dumps of various wikis from
dumps.wikimedia.org

each dump has numbered directories in the format 20210501, 20210401,
20210301, etc.

all the filenames in these directories remain the same with each
wiki's dump, with the exception of enwiki

other than enwiki, these range from about 30gb to about 370gb
uncompressed with each successive dump

enwiki, the main english wikipedia, has mostly the same named files,
but has the pages-meta-history.xml file split up into various 1-55gb
compressed files (mostly 1-2gb) making a total of about 700gb
compressed (disregarding redundant files)

i'm not sure how big enwiki is uncompressed, but could be close to
25tb. i haven't figured out how i could make rdiff-backup more
efficient with these files, aside from a script to merge each
metahistory file into a single huge >100gb file and then running
rdiff-backup, and then splitting the file back into their separate
files with an index after restoring

i'm using btrfs zstd:15 to store the files uncompressed, however i
don't have enough storage to store enwiki uncompressed, zstd
compression just isn't that good, even at maximum - i've used xz
compression which attains much better rates of compression for other
wikis but that isn't exactly seamless (experiments with fuse failed)

so, to save space, i thought i would use rdiff-backup so that it would
only store the differences between dumps, and it works very well in
initial tests, however, if i run the reverse incremental backups one
after the other today, they would be dated today, rather than
20210501, 20210401, etc. which isn't informative

if i could add a comment next to each datetime stamp, this would be
useful, otherwise i'll have to keep a separate index, which isn't a
huge problem, i just thought i'd ask if i could change the datetime
stamps before i write such a script

On Thu, 22 Apr 2021 at 15:19, Eric Lavarde <e...@lavar.de> wrote:
Hi Griffin,

On 22/04/2021 06:39, griffin tucker wrote:
is there a way to change the timestamps of the backups?
no

or perhaps replace the timestamps with a unique name?
no

would this cause a faulty restore or a damaged backup?
yes, rdiff-backup makes a lot of date/time comparaisons so the
timestamp
is meaningful.

What are you trying to do?

KR, Eric
Since you are already using btrfs, have you considered using
deduplication? Likely to work better if you store uncompressed.

In your scenario I would expect deduplication to give big savings if
you store uncompressed. If not, YMMV. (I tried with rdiff-backup on
btrfs + deduplication a few years ago but found it all a bit scary and
retreated to ext4.)
To clarify, I mean turning off compression within rdiff-backup, and
instead using compression (+deduplication) at fs level.
well, i suppose i was using windows server's dedupe in that 6gb per
30gb savings, maybe i should try again with btrfs' dedupe

come to think of it, dedupe seems to be already enabled which would
explain <5 second copies for hundreds of gigabytes, but i can't get
the dedupe status when i run:

btrfs dedupe status <mountpoint>

with an error

btrfs: unknown token 'dedupe'

i'll investiage this further
Another option is to use ZFS, Patrik wrote about it here: https://www.ikus-soft.com/en/blog/2020-07-22-configure-zfs-for-rdiff-backup/

Reply via email to