[ https://issues.apache.org/jira/browse/SVN-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256056#comment-17256056 ]
Karl Fogel edited comment on SVN-525 at 12/30/20, 1:49 AM: ----------------------------------------------------------- Hi, Aditi Maurya. Thanks for your interest in this issue. It's been a long time since I was a core developer, but I think I still understand enough of the internals of Subversion to point you in the right direction. Let's start with some background. Right now, Subversion stores a pristine copy of the BASE revision (that is, the currently checked out revision) of each file locally. These pristine copies are stored under the .svn/pristine/ directory in the top level of each checked-out working copy. Inside .svn/pristine/, you'll see a bunch of subdirectories with two-character names, and then inside each subdirectory there are some ".svn-base" files, where each file's name is an SHA1 hash. What's going on is exactly the kind of content-addressed arrangement you suspect :-). The purpose of these pristine BASE copies is threefold: # To make commits use less network bandwidth, because the commit only needs to send to the repository the differences between the local BASE version and the locally modified working file. (Remember that an SVN "commit" is like a "commit + push" would be in Git.) # To make 'svn diff' and 'svn revert' be purely local operations, that don't talk to the upstream repository over the network. # To enable the occasional three-way merge. This kind of merge is less common in Subversion than in distributed version control systems (DVCS), but it still is done sometimes. Obviously, (1) is just an optimization. One *could* always just send the full new file contents in a commit. Furthermore, the optimization only happens for text files anyway, like program source code or plaintext documentation. Binary-format files (such as LibreOffice files, different versions of a video, PDFs, compilation output, etc) are not diffable/mergeable, at least not in the practical sense needed by a version control system, so committing them always ends up transmitting the entire new contents anyway. Regarding (2): for binary (non-mergeable, non-diffable) files, no one is doing 'svn diff' against the BASE revision anyway. And while it's nice if 'svn revert' can be purely local, that's not a "must have" behavior. It's okay if it's local when the pristine BASE copy is present but uses the network when the BASE copy is not present. And regarding (3): just as with (2), one could get the BASE copy from the upstream repository if necessary, but, again, no one is doing three-way merges on binary blobs anyway. (By the way, note that Subversion doesn't store full history on the client side, the way DVCSs like Git and Mercurial do. That's why in Subversion we call the local side a "working copy" or "working tree", not a "repository". For textual/mergeable materials, the DVCS way is superior -- having full history locally is great, and you can afford it when the history can be stored in an efficient internal-diff way locally. However, when you're keeping successive versions of big binary blobs under version control, the DVCS way doesn't work well: it requires too much storage on every client machine. For this situation, SVN's way is better, because only the central repository server needs that kind of storage.) Okay, the above is all background. Now here's what this issue is about: For those who do use Subversion to version binary blobs, it's already workable, but the problem is it is still using *double* the amount of client-side disk space it needs to. When it comes to trees with really large objects, this is a problem! Those pristine BASE versions are not helping anyone: they don't make commits more efficient in this use case, and since these files are almost never mergeable no one is doing 'svn diff' nor three-way merges on them either. At the most, someone might want to do an 'svn revert', but it's okay if that's not a purely local operation. If the BASE version of a file weren't present, most operations would work just fine. Even Subversion's network protocol for transmitting changes wouldn't need to be updated, because that protocol naturally already has an "insert the following N bytes" command already. Therefore one can *always* construct a commit-transmission diff as a series of inserts, without reference to any BASE contents. (Now, of course, if both the client and server were updated to support some kind of 'sendfile' functionality for that circumstance, that might be even more efficient, but that's an optimization. New clients will still be able to work with old servers, guaranteed.) So the modification needed here is purely client-side. To make this change, all one has to do is find the parts of the working copy code that currently consult the pristine BASE version and make them still work when the pristine BASE is not present. While Subversion is already a decent system for keeping track of large binary blobs, with this change, it would be a really *good* system for doing that, especially because of its optional file-locking feature. (None of the DVCSs are really suitable for this use case, by design, as far as I'm aware.) I think the mechanics of the change would involve code under {{subversion/libsvn_wc/}} and maybe {{subversion/libsvn_client/}} in the [Subversion sources|http://subversion.apache.org/source-code.html]. The decision about when to omit a pristine BASE copy should be made purely by the client side, as different people may configure it differently depending on their local storage capacity. This would mean some kind of new specifier in the [client-side run-time configuration area|http://svnbook.red-bean.com/nightly/en/svn.advanced.confarea.html], like maybe a {{no-pristine}} option that can be set based any of file size, file name pattern, or file mime-type. (There may be some discussion of user-facing design questions earlier in this issue, too.) I'm very happy to answer more questions here, and I'd also suggest that you post to the Subversion Development mailing list (see the [mailing lists|http://subversion.apache.org/mailing-lists.html] page) with questions/ideas. If you post there, please CC me (kfogel {_AT_} red-bean.com); I'm not subscribed to the list these days, but I'd like to follow any progress on this issue and help where I can. There are many much more experienced developers on the list too, and they'll be able to save you a lot of time. was (Author: kfogel): Hi, Aditi Maurya. Thanks for your interest in this issue. It's been a long time since I was a core developer, but I think I still understand enough of the internals of Subversion to point you in the right direction. Let's start with some background. Right now, Subversion stores a pristine copy of the BASE revision (that is, the currently checked out revision) of each file locally. These pristine copies are stored under the .svn/pristine/ directory in the top level of each checked-out working copy. Inside .svn/pristine/, you'll see a bunch of subdirectories with two-character names, and then inside each subdirectory there are some ".svn-base" files, where each file's name is an SHA1 hash. What's going on is exactly the kind of content-addressed arrangement you suspect :-). The purpose of these pristine BASE copies is threefold: # To make commits use less network bandwidth, because the commit only needs to send to the repository the differences between the local BASE version and the locally modified working file. (Remember that an SVN "commit" is like a "commit + push" would be in Git.) # To make 'svn diff' and 'svn revert' be purely local operations, that don't talk to the upstream repository over the network. # To enable the occasional three-way merge. This kind of merge is less common in Subversion than in distributed version control systems (DVCS), but it still is done sometimes. Obviously, (1) is just an optimization. One *could* always just send the full new file contents in a commit. Furthermore, the optimization only happens for text files anyway, like program source code or plaintext documentation. Binary-format files (such as LibreOffice files, different versions of a video, PDFs, compilation output, etc) are not diffable/mergeable, at least not in the practical sense needed by a version control system, so committing them always ends up transmitting the entire new contents anyway. Regarding (2): for binary (non-mergeable, non-diffable) files, no one is doing 'svn diff' against the BASE revision anyway. And while it's nice if 'svn revert' can be purely local, that's not a "must have" behavior. It's okay if it's local when the pristine BASE copy is present but uses the network when the BASE copy is not present. And regarding (3): just as with (2), one could get the BASE copy from the upstream repository if necessary, but, again, no one is doing three-way merges on binary blobs anyway. (By the way, note that Subversion doesn't store full history on the client side, the way DVCSs like Git and Mercurial do. That's why in Subversion we call the local side a "working copy" or "working tree", not a "repository". For textual/mergeable materials, the DVCS way is superior -- having full history locally is great, and you can afford it when the history can be stored in an efficient internal-diff way locally. However, when you're keeping successive versions of big binary blobs under version control, the DVCS way doesn't work well: it requires too much storage on every client machine. For this situation, SVN's way is better, because only the central repository server needs that kind of storage.) Okay, the above all background. Now here's what this issue is about: For those who do use Subversion to version binary blobs, it's already workable, but the problem is it is still using *double* the amount of client-side disk space it needs to. When it comes to trees with really large objects, this is a problem! Those pristine BASE versions are not helping anyone: they don't make commits more efficient in this use case, and since these files are almost never mergeable no one is doing 'svn diff' nor three-way merges on them either. At the most, someone might want to do an 'svn revert', but it's okay if that's not a purely local operation. If the BASE version of a file weren't present, most operations would work just fine. Even Subversion's network protocol for transmitting changes wouldn't need to be updated, because that protocol naturally already has an "insert the following N bytes" command already. Therefore one can *always* construct a commit-transmission diff as a series of inserts, without reference to any BASE contents. (Now, of course, if both the client and server were updated to support some kind of 'sendfile' functionality for that circumstance, that might be even more efficient, but that's an optimization. New clients will still be able to work with old servers, guaranteed.) So the modification needed here is purely client-side. To make this change, all one has to do is find the parts of the working copy code that currently consult the pristine BASE version and make them still work when the pristine BASE is not present. While Subversion is already a decent system for keeping track of large binary blobs, with this change, it would be a really *good* system for doing that, especially because of its optional file-locking feature. (None of the DVCSs are really suitable for this use case, by design, as far as I'm aware.) I think the mechanics of the change would involve code under {{subversion/libsvn_wc/}} and maybe {{subversion/libsvn_client/}} in the [Subversion sources|http://subversion.apache.org/source-code.html]. The decision about when to omit a pristine BASE copy should be made purely by the client side, as different people may configure it differently depending on their local storage capacity. This would mean some kind of new specifier in the [client-side run-time configuration area|http://svnbook.red-bean.com/nightly/en/svn.advanced.confarea.html], like maybe a {{no-pristine}} option that can be set based any of file size, file name pattern, or file mime-type. (There may be some discussion of user-facing design questions earlier in this issue, too.) I'm very happy to answer more questions here, and I'd also suggest that you post to the Subversion Development mailing list (see the [mailing lists|http://subversion.apache.org/mailing-lists.html] page) with questions/ideas. If you post there, please CC me (kfogel {_AT_} red-bean.com); I'm not subscribed to the list these days, but I'd like to follow any progress on this issue and help where I can. There are many much more experienced developers on the list too, and they'll be able to save you a lot of time. > Allow working copies without .svn/pristine/ cache (a.k.a. "text-base/" files). > ------------------------------------------------------------------------------ > > Key: SVN-525 > URL: https://issues.apache.org/jira/browse/SVN-525 > Project: Subversion > Issue Type: New Feature > Affects Versions: all > Environment: other > Reporter: Ben Collins-Sussman > Priority: Minor > Fix For: unscheduled > > > It's possible to make the cached pristine files in .svn/pristine/ optional. > Doing so would be a huge storage savings on the client side, and would make > Subversion even more compelling as a system for managing medium-large binary > files. > A much more technically thorough explanation of this issue and its background > is available in [this 2020-12-29 > comment|https://issues.apache.org/jira/browse/SVN-525?focusedCommentId=17256056&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17256056] > below. > (Note that the cached pristine base versions used to be stored in > .svn/text-base/, so you'll probably see references to that old location > throughout this ticket. Also, there used to be one .svn/ directory per > working tree directory; later that was changed to one .svn/ directory at the > top of the working tree. Knowing that might also help clarify some of the > older comments in this ticket.) -- This message was sent by Atlassian Jira (v8.3.4#803005)