On Fri, Dec 6, 2024 at 12:27 PM Nikola Dipanov <ndipa...@hudson-trading.com> wrote: > > Hi all, > > A quick summary first: we’ve found that for certain use cases - increasing SVN_DELTA_WINDOW_SIZE yields very significant storage savings on the size of the repo (~10x). We have a POC patch to make it configurable (currently only via fsfs.conf and using libsvn_ra_svn and libsvn_ra_local) and would be interested in working with the community to see if the changes could be improved, with the ultimate goal to have them accepted into the mainline. > > I have found some previous discussions on this from a good while back: https://svn.haxx.se/users/archive-2008-02/0547.shtml. but not much else. If there’s any additional information on this problem that I may have missed - please feel free to educate me! > > Otherwise if there are no glaring issues that anyone can see in supporting this - I’d be happy to look into cleaning up my POC patch and post it for detailed review. > > Obviously, there are trade-offs in doing this: A repo would most likely need to use a specified window size throughout its lifetime, and would not be beneficial for every use-case. This is why we propose to keep it as a config option. Some more details of our use-case and some rough numbers that illustrate the benefits follow. > > Our use case is that we commonly have very large files that see small changes over time, and a large xdelta window size would benefit such use-cases greatly. For us - the growth pace of our repositories is quite staggering due to this amplification, and we’d be more than happy to trade memory usage (especially, but also processing time to an extent) to be able to keep this in check. We’ve not done extensive measurements on how this impacts runtime yet. > > To demonstrate the effect - we generate a random and fairly large (~1.3 G, ~30 million lines) XML file, commit it, then make random changes to it (from 300 lines to ~1% of lines), committed those, and then looked at the file size generated by the commit in repo/db/revs/. We then repeat this, but with configuring a 100x larger window size (10240000). The most dramatic results are for small changes (this is somewhat intuitive) where the size of the revision with changes is ~40x smaller (10k vs 430k for a 1.3Gig file) when using a larger window size. For different patterns of changes the difference is not that dramatic but still large (5-10x). I am happy to share more details if people are interested. > > We believe we’d see a lot of benefit from this option, as would others in the community, and are very much committed to Subversion in the long run, so would love to hear what people think about something like this.
That sounds quite interesting. However, I'm a bit worried about compatibility. The thread you metioned has one reply [1] which points to another thread about a possible backward compatibility issue [2]. In that thread, Daniel Berlin wrote: > I also think we should up our default window size, but due to some > silliness in how this is currently implemented, changing the window size > is backwards incompatible! > > (This is because the code currently relies on knowing whether there are > more windows waiting to consume by comparing the window length against > the default window size, instead of by seeing if there is an EOF :( ) This is confirmed by Branko Čibej in another reply [3]. It's not clear to me whether those limitations still hold, but assuming they do there are some possible issues: - fsfs.conf settings can be changed at any time during the lifetime of the repository. So the first 1000 revisions may have a different window size than later ones. The code reading those revisions cannot handle that (if I interpret the above threads correctly). - A vanilla SVN server (or a client accessing the repository with file://) should at least be able to read the repository, even it was constructed with a different window size. Here too I fear this won't work (even if the window size would be a "immutable setting at creation time"). It all hinges on whether or not the current FSFS code can read data compressed with different window sizes (different from its own setting). Maybe things have changed since 2008 (there were several new versions of the fsfs format since then), maybe they haven't. Perhaps you can perform some experiments to test for the above issues, as a form of further exploration. [1] https://svn.haxx.se/users/archive-2008-02/0553.shtml [2] http://svn.haxx.se/dev/archive-2006-07/0911.shtml [3] https://svn.haxx.se/dev/archive-2006-07/0913.shtml -- Johan