[ 
https://issues.apache.org/jira/browse/SVN-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256056#comment-17256056
 ] 

Karl Fogel edited comment on SVN-525 at 12/30/20, 1:49 AM:
-----------------------------------------------------------

Hi, Aditi Maurya.  Thanks for your interest in this issue.  It's been a long 
time since I was a core developer, but I think I still understand enough of the 
internals of Subversion to point you in the right direction.

Let's start with some background.  Right now, Subversion stores a pristine copy 
of the BASE revision (that is, the currently checked out revision) of each file 
locally.  These pristine copies are stored under the .svn/pristine/ directory 
in the top level of each checked-out working copy.  Inside .svn/pristine/, 
you'll see a bunch of subdirectories with two-character names, and then inside 
each subdirectory there are some ".svn-base" files, where each file's name is 
an SHA1 hash.  What's going on is exactly the kind of content-addressed 
arrangement you suspect :-).

The purpose of these pristine BASE copies is threefold:

# To make commits use less network bandwidth, because the commit only needs to 
send to the repository the differences between the local BASE version and the 
locally modified working file.  (Remember that an SVN "commit" is like a 
"commit + push" would be in Git.)

# To make 'svn diff' and 'svn revert' be purely local operations, that don't 
talk to the upstream repository over the network.

# To enable the occasional three-way merge.  This kind of merge is less common 
in Subversion than in distributed version control systems (DVCS), but it still 
is done sometimes.

Obviously, (1) is just an optimization.  One *could* always just send the full 
new file contents in a commit.  Furthermore, the optimization only happens for 
text files anyway, like program source code or plaintext documentation.  
Binary-format files (such as LibreOffice files, different versions of a video, 
PDFs, compilation output, etc) are not diffable/mergeable, at least not in the 
practical sense needed by a version control system, so committing them always 
ends up transmitting the entire new contents anyway.

Regarding (2): for binary (non-mergeable, non-diffable) files, no one is doing 
'svn diff' against the BASE revision anyway.  And while it's nice if 'svn 
revert' can be purely local, that's not a "must have" behavior.  It's okay if 
it's local when the pristine BASE copy is present but uses the network when the 
BASE copy is not present.

And regarding (3): just as with (2), one could get the BASE copy from the 
upstream repository if necessary, but, again, no one is doing three-way merges 
on binary blobs anyway.

(By the way, note that Subversion doesn't store full history on the client 
side, the way DVCSs like Git and Mercurial do.  That's why in Subversion we 
call the local side a "working copy" or "working tree", not a "repository".  
For textual/mergeable materials, the DVCS way is superior -- having full 
history locally is great, and you can afford it when the history can be stored 
in an efficient internal-diff way locally. However, when you're keeping 
successive versions of big binary blobs under version control, the DVCS way 
doesn't work well: it requires too much storage on every client machine.  For 
this situation, SVN's way is better, because only the central repository server 
needs that kind of storage.)

Okay, the above is all background.  Now here's what this issue is about:

For those who do use Subversion to version binary blobs, it's already workable, 
but the problem is it is still using *double* the amount of client-side disk 
space it needs to.  When it comes to trees with really large objects, this is a 
problem!  Those pristine BASE versions are not helping anyone: they don't make 
commits more efficient in this use case, and since these files are almost never 
mergeable no one is doing 'svn diff' nor three-way merges on them either.  At 
the most, someone might want to do an 'svn revert', but it's okay if that's not 
a purely local operation.

If the BASE version of a file weren't present, most operations would work just 
fine.  Even Subversion's network protocol for transmitting changes wouldn't 
need to be updated, because that protocol naturally already has an "insert the 
following N bytes" command already.  Therefore one can *always* construct a 
commit-transmission diff as a series of inserts, without reference to any BASE 
contents.

(Now, of course, if both the client and server were updated to support some 
kind of 'sendfile' functionality for that circumstance, that might be even more 
efficient, but that's an optimization.  New clients will still be able to work 
with old servers, guaranteed.)

So the modification needed here is purely client-side.  To make this change, 
all one has to do is find the parts of the working copy code that currently 
consult the pristine BASE version and make them still work when the pristine 
BASE is not present.

While Subversion is already a decent system for keeping track of large binary 
blobs, with this change, it would be a really *good* system for doing that, 
especially because of its optional file-locking feature.  (None of the DVCSs 
are really suitable for this use case, by design, as far as I'm aware.)

I think the mechanics of the change would involve code under 
{{subversion/libsvn_wc/}} and maybe {{subversion/libsvn_client/}} in the 
[Subversion sources|http://subversion.apache.org/source-code.html].  The 
decision about when to omit a pristine BASE copy should be made purely by the 
client side, as different people may configure it differently depending on 
their local storage capacity.  This would mean some kind of new specifier in 
the [client-side run-time configuration 
area|http://svnbook.red-bean.com/nightly/en/svn.advanced.confarea.html], like 
maybe a {{no-pristine}} option that can be set based any of file size, file 
name pattern, or file mime-type.  (There may be some discussion of user-facing 
design questions earlier in this issue, too.)

I'm very happy to answer more questions here, and I'd also suggest that you 
post to the Subversion Development mailing list (see the [mailing 
lists|http://subversion.apache.org/mailing-lists.html] page) with 
questions/ideas.  If you post there, please CC me (kfogel {_AT_} red-bean.com); 
I'm not subscribed to the list these days, but I'd like to follow any progress 
on this issue and help where I can.  There are many much more experienced 
developers on the list too, and they'll be able to save you a lot of time.



was (Author: kfogel):
Hi, Aditi Maurya.  Thanks for your interest in this issue.  It's been a long 
time since I was a core developer, but I think I still understand enough of the 
internals of Subversion to point you in the right direction.

Let's start with some background.  Right now, Subversion stores a pristine copy 
of the BASE revision (that is, the currently checked out revision) of each file 
locally.  These pristine copies are stored under the .svn/pristine/ directory 
in the top level of each checked-out working copy.  Inside .svn/pristine/, 
you'll see a bunch of subdirectories with two-character names, and then inside 
each subdirectory there are some ".svn-base" files, where each file's name is 
an SHA1 hash.  What's going on is exactly the kind of content-addressed 
arrangement you suspect :-).

The purpose of these pristine BASE copies is threefold:

# To make commits use less network bandwidth, because the commit only needs to 
send to the repository the differences between the local BASE version and the 
locally modified working file.  (Remember that an SVN "commit" is like a 
"commit + push" would be in Git.)

# To make 'svn diff' and 'svn revert' be purely local operations, that don't 
talk to the upstream repository over the network.

# To enable the occasional three-way merge.  This kind of merge is less common 
in Subversion than in distributed version control systems (DVCS), but it still 
is done sometimes.

Obviously, (1) is just an optimization.  One *could* always just send the full 
new file contents in a commit.  Furthermore, the optimization only happens for 
text files anyway, like program source code or plaintext documentation.  
Binary-format files (such as LibreOffice files, different versions of a video, 
PDFs, compilation output, etc) are not diffable/mergeable, at least not in the 
practical sense needed by a version control system, so committing them always 
ends up transmitting the entire new contents anyway.

Regarding (2): for binary (non-mergeable, non-diffable) files, no one is doing 
'svn diff' against the BASE revision anyway.  And while it's nice if 'svn 
revert' can be purely local, that's not a "must have" behavior.  It's okay if 
it's local when the pristine BASE copy is present but uses the network when the 
BASE copy is not present.

And regarding (3): just as with (2), one could get the BASE copy from the 
upstream repository if necessary, but, again, no one is doing three-way merges 
on binary blobs anyway.

(By the way, note that Subversion doesn't store full history on the client 
side, the way DVCSs like Git and Mercurial do.  That's why in Subversion we 
call the local side a "working copy" or "working tree", not a "repository".  
For textual/mergeable materials, the DVCS way is superior -- having full 
history locally is great, and you can afford it when the history can be stored 
in an efficient internal-diff way locally. However, when you're keeping 
successive versions of big binary blobs under version control, the DVCS way 
doesn't work well: it requires too much storage on every client machine.  For 
this situation, SVN's way is better, because only the central repository server 
needs that kind of storage.)

Okay, the above all background.  Now here's what this issue is about:

For those who do use Subversion to version binary blobs, it's already workable, 
but the problem is it is still using *double* the amount of client-side disk 
space it needs to.  When it comes to trees with really large objects, this is a 
problem!  Those pristine BASE versions are not helping anyone: they don't make 
commits more efficient in this use case, and since these files are almost never 
mergeable no one is doing 'svn diff' nor three-way merges on them either.  At 
the most, someone might want to do an 'svn revert', but it's okay if that's not 
a purely local operation.

If the BASE version of a file weren't present, most operations would work just 
fine.  Even Subversion's network protocol for transmitting changes wouldn't 
need to be updated, because that protocol naturally already has an "insert the 
following N bytes" command already.  Therefore one can *always* construct a 
commit-transmission diff as a series of inserts, without reference to any BASE 
contents.

(Now, of course, if both the client and server were updated to support some 
kind of 'sendfile' functionality for that circumstance, that might be even more 
efficient, but that's an optimization.  New clients will still be able to work 
with old servers, guaranteed.)

So the modification needed here is purely client-side.  To make this change, 
all one has to do is find the parts of the working copy code that currently 
consult the pristine BASE version and make them still work when the pristine 
BASE is not present.

While Subversion is already a decent system for keeping track of large binary 
blobs, with this change, it would be a really *good* system for doing that, 
especially because of its optional file-locking feature.  (None of the DVCSs 
are really suitable for this use case, by design, as far as I'm aware.)

I think the mechanics of the change would involve code under 
{{subversion/libsvn_wc/}} and maybe {{subversion/libsvn_client/}} in the 
[Subversion sources|http://subversion.apache.org/source-code.html].  The 
decision about when to omit a pristine BASE copy should be made purely by the 
client side, as different people may configure it differently depending on 
their local storage capacity.  This would mean some kind of new specifier in 
the [client-side run-time configuration 
area|http://svnbook.red-bean.com/nightly/en/svn.advanced.confarea.html], like 
maybe a {{no-pristine}} option that can be set based any of file size, file 
name pattern, or file mime-type.  (There may be some discussion of user-facing 
design questions earlier in this issue, too.)

I'm very happy to answer more questions here, and I'd also suggest that you 
post to the Subversion Development mailing list (see the [mailing 
lists|http://subversion.apache.org/mailing-lists.html] page) with 
questions/ideas.  If you post there, please CC me (kfogel {_AT_} red-bean.com); 
I'm not subscribed to the list these days, but I'd like to follow any progress 
on this issue and help where I can.  There are many much more experienced 
developers on the list too, and they'll be able to save you a lot of time.


> Allow working copies without .svn/pristine/ cache (a.k.a. "text-base/" files).
> ------------------------------------------------------------------------------
>
>                 Key: SVN-525
>                 URL: https://issues.apache.org/jira/browse/SVN-525
>             Project: Subversion
>          Issue Type: New Feature
>    Affects Versions: all
>         Environment: other
>            Reporter: Ben Collins-Sussman
>            Priority: Minor
>             Fix For: unscheduled
>
>
> It's possible to make the cached pristine files in .svn/pristine/ optional.  
> Doing so would be a huge storage savings on the client side, and would make 
> Subversion even more compelling as a system for managing medium-large binary 
> files.
> A much more technically thorough explanation of this issue and its background 
> is available in [this 2020-12-29 
> comment|https://issues.apache.org/jira/browse/SVN-525?focusedCommentId=17256056&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17256056]
>  below.
> (Note that the cached pristine base versions used to be stored in 
> .svn/text-base/, so you'll probably see references to that old location 
> throughout this ticket.  Also, there used to be one .svn/ directory per 
> working tree directory; later that was changed to one .svn/ directory at the 
> top of the working tree.  Knowing that might also help clarify some of the 
> older comments in this ticket.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to