On Wed, 4 Feb 2026 at 08:45, Marko Mäkelä <[email protected]> wrote:
>
> On Tue, Feb 3, 2026 at 7:47 PM Gordan Bobic via discuss
> <[email protected]> wrote:
> > But I guess if you could quickly block clone everything and
> > mariabackup is aware of it, then that would minimize the backup window
> > during which the redo log is at risk of overflowing.
>
> The current circular InnoDB WAL (ib_logfile0) would make the
> block-clone a little tricky. If we could block all writes to the file
> for a short time, then I think it could work.

I was actually thinking about it in terms of simply using block
cloning as a faster way to copy because it doesn't have to actually
copy the block. I was sort of assuming that this block-cloning would
be copied-on-write for future writes (a bit like FL-COW:
http://www.xmailserver.org/flcow.html ).
If you can make the copy orders of magnitude faster by skipping the
copy and only COW-ing blocks on future writes on the original, then
WAL roll-over is that much less likely to happen.

> > It has been a long time since I looked at btrfs, but I seem to vaguely
> > recall that it's incrementals still involve reading the entire old and
> > new files to compute the delta, which is very inefficient,
> > particularly with databases where updating a single row means having
> > to re-read the entire tablespace.
> > ZFS is significantly more advanced than that and only has to read and
> > send the blocks that have actually changed.
>
> Thank you, this is very useful. Your description of incremental btrfs
> transfer resembles the way how mariadb-backup --backup --incremental
> currently works: it really reads all *.ibd files to find out which
> pages have been changed.

Yes - and doing that makes incrementals take as long as a full backup
- a significant problem when you have a 100TB database and the biggest
problem is the time it takes read the data files, not the amount of
space the backup takes on the backup side.

> With the innodb_log_archive format, you would basically only copy the
> log that was written after the previous (full or incremental) backup
> finished, and it would cover all changes to InnoDB files. This would
> be analogous to the incremental ZFS snapshot transfer. However, the
> binlog, the .frm files and the files of any other storage engines
> would still have to be handled separately, until and unless an option
> is implemented to persist everything via a single log.

Yes, that would be sort of analogue to ZFS incrementals, with the
downside that you potentially have multiples longer --prepare stage.

Just out of interest - why not do the extra WALs purely in
mariabackup? When it detects that the redo log has come full circle,
simply write more on the backup side, and replay those additional WALs
during --prepare?

> > > I have also been thinking of implementing a live streaming backup in
> > > the tar format. Perhaps, for performance reasons, there should be an
> > > option to create multiple streams in parallel. I am yet to experiment
> > > with this.
> >
> > I don't think tar can do that, which is why there is no such thing as
> > a parallel tar.
>
> Above, I was thinking of an option to split the content into multiple
> tar streams, which could be processed in parallel.

Right, but then you end up with multiple tars which have to mesh
together which could be difficult, and xbstream/mbstream already does
all this well.

> > And tar can actually be a serious single-threaded bottleneck when you
> > are using NVMe drives and 10G+ networking.
>
> Can you think of anything that would allow efficient streaming using a
> single TCP/IP connection in this kind of an environment?

xbstream/mbstream already seems to do that. Ultimately everything is
going to be bottlenecked on a single thread. zfs send sends in a
single thread, it's just that it is sufficiently light to not
bottleneck before NVMe disks or 10G+ ethernet do.
One additional advantage of ZFS send is that if your data is
compressed by ZFS, sending (-c) can avoid decompressing and
recompressing, which means you save disk I/O (compressed on disk), CPU
on the source (no need to decompress + recompress for the network
transfer), network bandwidth (data is already compressed), and CPU and
disk I/O again on the target side. The only downside is that it means
you have to rebuild the server if it isn't already on ZFS. Less of a
problem on Ubuntu that ships with it, slightly more of a problem on EL
suffering from "not-invented-here syndrome", but in either case far
from insurmountable.

> > And the only real tunable only shifts it by about 33% (on x86-64 -
> > other platforms may be different):
> > https://shatteredsilicon.net/tuning-tar/
> > And 33% doesn't really move the needle enough for large fast servers
> > that run database 10s of terabytes in size.
>
> As demonstrated in https://jira.mariadb.org/browse/MDEV-38362, some
> more performance could be squeezed by using the Linux system calls
> sendfile(2) or splice(2). Unfortunately, both system calls are limited
> to copying 65536 bytes at a time. Such offloading is possible with the
> tar format, because there is no CRC on the data payload, only on the
> metadata.
>
> I fear that we may need multiple streams, which would complicate the
> interface. The simplest that I can come up with would be to specify
> the number of streams as well as the name of a script:
>
> BACKUP SERVER WITH 8 CLIENT '/path/to/my_script';
>
> The above would reuse existing reserved words. The specified script
> may make use of a unique parameter (stream number), something like the
> following:
>
> #!/bin/sh
> zstd|ssh [email protected] "cat>$1.tar.zstd"
>
> This kind of a format would allow full flexibility for any further
> processing. For example, you could extract multiple streams in
> parallel if you have fast storage:
>
> for i in *.tar.zstd; do tar xf - "$i" --zstd -C /data & done

Doesn't xbstream/mbstream format already implement enough stream
interleaving for parallel processing to be "good enough"?

> Streaming backup is something that I plan to work on after
> implementing a BACKUP SERVER command that targets a mounted file
> system. For that, I plan to primarily leverage the Linux
> copy_file_range(2), which can copy (or block-clone) up to 2 gigabytes
> per call, falling back to sendfile(2) and ultimately pread(2) and
> write(2).

Cool, that will make non-ZFS deployments less painful.


-- 
Gordan Bobic
Database Specialist, Shattered Silicon Ltd.
https://shatteredsilicon.net
Follow us:
LinkedIn: https://www.linkedin.com/company/shatteredsilicon
X: https://x.com/ssiliconbg
_______________________________________________
discuss mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to