Hi Evgeny, Brane
Thank you for taking the time to go through the patch in depth, you
bring up a lot of good points about the feature.

> Also, the --store-pristine=no option already provides a consistent size
> reduction on all file systems and OSes, without disabling the streaminess,
> regressing other characteristics or requiring any kind of in-advance probing.

So "--store-pristine=no" provides the largest disk savings, but it
comes with a few downsides:
- The biggest downside is that it's opt-in, so the average user won't
know about it or use it
- You lose the ability to diff files without pulling data from the SVN server
- It requires a new working copy format which older tools won't support

> The optimization is far from universally applicable.  For example, it
> doesn't cover the default file system on Windows (NTFS).

Across the 3 OSes, Windows users are probably least likely to have a
reflink supporting filesystem since it was only made available on
standard versions of Windows about 3 years ago. Although they are
pushing its usage via the Dev Drive feature which you can see here:
https://learn.microsoft.com/en-us/windows/dev-drive/

Linux is similar to Windows in the majority of distributions still
default to ext4 (no reflink), but a higher % of users likely have
reflink-supporting filesystems because reflink supporting filesystems
have been around longer and there is a greater presence of power
users. Some distributions like Fedora, use Btrfs (reflink supported)
as the default filesystem.

MacOS is the best case by far, the only supported filesystem is APFS
which has reflink support. The only case where it wouldn't apply is if
the repository is on an external drive formatted as FAT32.

> The required probing is too expensive to run in every scope where it
> would be needed.
>
> For example, probing on Windows calls GetVolumeInformation().  Probing
> on other OSes has to create a file to check whether copy-on-write is
> supported.
>
> The patch limits this probing only for larger scopes, such as attaching it
> to an update editor instance, but smaller-scoped operations like workqueue
> installs or file reverts are left without probing.  If not all code paths
> use the copy-on-write path, the size reduction may not be sustainable or
> predictable in the long run.

I only added the probing so that checkouts can still benefit from
"streamy checkouts" when reflink support is absent. The probing isn't
that slow, but it's also not really needed for the smaller operations
that still use the workqueue installs, the implementation differs
slightly by platform:

Linux has the best API support for this, ficlone_fd can be used with
an existing open file handle. If that fails due to lack of reflink
support the fallback uses the same open handle to perform a regular
byte copy. This is only 1 extra system call per file in the case of no
reflink support.

MacOS is similar to Linux but the API requires file paths instead of
handles, making the fallback path less smooth. But also the fallback
is unlikely to be needed often due to macOS's good reflink support.

On Windows, CopyFile2 is a generic file copy API, so even on NTFS this
performs a regular file copy; it is not a failed call like on Linux.
There is a fallback path if it fails, but I'm not sure if there are
any cases where CopyFile2 fails where a byte copy loop would succeed.
The reason to avoid using CopyFile2 via probing is only to get the
"streamy checkout" benefits which don't apply to all SVN operations
anyway.

> Skipping the streamy checkout path can potentially regress performance and
> cause spurious HTTP timeouts and failures. To some extent, this is a step
> back from what we currently have on trunk.

>From what I've gathered the HTTP timeouts were caused when large files
were being written to disk. Setting up the reflink is only a small
metadata operation, so it will always be very fast even on large
files. Since the file data isn't being written to disks it should
avoid the HTTP timeouts issue.

> The new copy_file_to_temp_copyfile_windows() appears to make significantly
> more syscalls than the current code on trunk.  It opens and closes a temp
> file, performs the copy, and changes the file attributes and mtime, all
> with separate path-based calls, which are very expensive on Windows.

So in my testing a checkout onto a ReFS drive on Windows is about 10%
slower than a streamy checkout on an NTFS drive, likely due to these
extra system calls. I wasn't 100% sure about SVN's file attribute
requirements, so I was overly strict on matching the existing path.
Perhaps some of those steps could be trimmed back to aid performance.
For me a 10% slowdown is well worth the disk space saving, but
obviously this is a personal opinion.

> Thoughts on creating a branch for this? It may be easier to work through
> actual working code, as well as test interaction with other working copy
> changes.
>
> Downside is that the code may just bitrot on a branch that's not
> actively maintained.

I don't think it needs its own branch, it would be easy to add a
global toggle via a pre-processor define since it's already gated
around the reflink support probing.

Thanks,
Jordan



On Wed, 10 Jun 2026 at 15:41, Jordan Peck <[email protected]> wrote:
>
> Hi all,
>
> I've spent time cutting the patch down to the minimal changes needed to get 
> the feature working while maintaining a sensible integration (not hacking it 
> in). The majority of the patch size now comes from the new io_copy_temp.c 
> file which holds all the native platform functions for probing/setting up 
> file reflinks.
>
> I've compiled and run the tests on Windows, Linux and MacOS this time.
>
> Hopefully this is a reviewable size now.
> Let me know what you think and any questions.
>
> Thanks,
> Jordan
>
>
> On Fri, 5 Jun 2026 at 21:02, Jordan Peck <[email protected]> wrote:
>>
>> Regarding the disk space savings, I haven't tested against pristineless, but 
>> the savings are also close to 50% on a repo with lots of large binary files:
>>>
>>> The large SVN repo we use at work is a 1.6TB fresh checkout on disk, with 
>>> this change that drops to 865GB (on Windows+ReFS). A huge saving due to the 
>>> large number of art assets in our repository!
>>> I ran some tests with and without the patch, the ~10% slowdown is from 
>>> losing the streamy checkout benefits and instead having to do extra system 
>>> calls.
>>>
>>> 1.16:      1588.22 GB  169 mins
>>> 1.16-CoW:  864.76 GB   202 mins
>>
>>
>>>  volatile svn_atomic_t svn_wc__test_writer_copy_source_count = 0;
>>
>> I agree this extern debug atomic for at est is awkward, I didn't really see 
>> any way around it if we want a test that checks the reflink path is actually 
>> being taken. Adding platform native methods for checking if a file is 
>> reflinked for testing reasons is not practical, given it's also not a simple 
>> to work out via the filesystem anyway. However, we could discuss reworking 
>> or removing the test to avoid this weird variable. I'm not sure about the 
>> warning you are seeing, I don't think I saw that when I built for 
>> windows/linux.
>>
>>>
>>> This means that, for example, a Subversion working copy, where most
>>> files have svn:eol-style=native, would see no improvement, correct?
>>
>> No they would see no benefit, which is why this saves slightly less space 
>> than pristineless. But in our large repo 95% of the disk usage comes from 
>> large binary blob files which can benefit. I realise this isn't the case for 
>> everyone, but if you have a repo with only code files with 
>> svn:eol-style=native it's probably not large enough to be a disk space 
>> concern in the first place.
>>
>>> How does this interact with the pristines-on-demand feature? How is this
>>> tested?
>>
>> There isn't a test for this but the "pristines-on-demand" takes precedence 
>> over reflinking, they shouldn't interfere with each other.
>>
>>> If clients that do not support CoW use the same working copy as a client
>>> that does, how do they interact?
>>
>> Yes! That's the best part, there's no working copy format change. The 
>> filesystem handles the reflinked files transparently once they've been set 
>> up. If you did a checkout on a CoW supporting drive (ReFS) and got the disk 
>> usage benefits of CoW you would still be able to copy your svn checkout 
>> folder to another drive with no CoW support (NTFS), it would transparently 
>> expand to its full size on the new drive, but otherwise would be fully 
>> functional.
>>
>>> Does this happen on every invocation? Concretely: if I'm on Windows with
>>> NTFS, I'll get two file open attempts instead of one, every time?
>>
>> There are details on this in the op but no, there is a 1 time check at the 
>> start of a checkout/update operation to see if the current filesystem 
>> supports CoW. If not all files will use the current "streamy-checkout" path 
>> avoiding any extra system calls.
>>
>>> Not that this is indicative of patch quality in general. But I'd have
>>> preferred to see some discussion on dev@ before being presented with
>>> 160k of patch. It's basically unreviewable, I don't even know where to
>>> start.
>>
>> I was originally working on this feature for myself and work colleagues. It 
>> was initially a hacky, Windows-only implementation. Then I added linux, 
>> polished it a bit and ended up fleshing it out a lot more. It got to the 
>> point were I figured I might as well bring it all together into a patch and 
>> share it here to get people's thoughts on it. I do realise it's a huge 
>> patch, and in the "Notes" section of the op I laid out how it could be split 
>> into several parts. But I didn't want to do the work to split it before 
>> gauging general interest in the feature.
>>
>> On Fri, 5 Jun 2026 at 19:50, Sean McBride <[email protected]> wrote:
>>>
>>> On 4 Jun 2026, at 7:31, Jordan Peck via dev wrote:
>>>
>>> > This patch saves disk space on supported filesystems for byte-identical 
>>> > file installation by utilising filesystem clone APIs.
>>>
>>> FWIW, just the other week, I've used this tool:
>>>
>>> https://github.com/ttkb-oss/dedup
>>>
>>> on my Mac, and several others in our office, to deduplicate files in our 
>>> svn (1.14) working copies. It detected identical files not only in our 
>>> /branches vs /trunk but also among the pristine copies, and it deduplicated 
>>> them using APFS' cloning feature.
>>>
>>> It didn't cause svn to freak out, so that bodes well.
>>>
>>> Sean

Reply via email to