On 2023-10-07 Sa 01:51, Thomas Munro wrote:
Hello hackers, Here is an experimental POC of fast/cheap database cloning. For clones from little template databases, no one cares much, but it might be useful to be able to create a snapshot or fork of very large database for testing/experimentation like this: create database foodb_snapshot20231007 template=foodb strategy=file_clone It should be a lot faster, and use less physical disk, than the two existing strategies on recent-ish XFS, BTRFS, very recent OpenZFS, APFS (= macOS), and it could in theory be extended to other systems that invented different system calls for this with more work (Solaris, Windows). Then extra physical disk space will be consumed only as the two clones diverge. It's just like the old strategy=file_copy, except it asks the OS to do its best copying trick. If you try it on a system that doesn't support copy-on-write, then copy_file_range() should fall back to plain old copy, but it might still be better than we could do, as it can push copy commands to network storage or physical storage. Therefore, the usual caveats from strategy=file_copy also apply here. Namely that it has to perform checkpoints which could be very expensive, and there are some quirks/brokenness about concurrent backups and PITR. Which makes me wonder if it's worth pursuing this idea. Thoughts? I tested on bleeding edge FreeBSD/ZFS, where you need to set sysctl vfs.zfs.bclone_enabled=1 to enable the optimisation, as it's still a very new feature that is still being rolled out. The system call succeeds either way, but that controls whether the new database initially shares blocks on disk, or get new copies. I also tested on a Mac. In both cases I could clone large databases in a fraction of a second.
I've had to disable COW on my BTRFS-resident buildfarm animals (see previous discussion re Direct I/O).
cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com