On 30 Jul 2021, Daniel Shahaf wrote:
Karl Fogel wrote on Tue, Jul 27, 2021 at 20:24:32 -0500:
1) Make pristine text-base files optional. See issue #525 for
details. In
summary: currently, every large file uses twice the storage on
the client
side, and yet for most of these files there's little benefit.
They're
usually not plaintext, so 'svn diff' against the pristine base
is pointless
(unless you have some specialized diff tool for the particular
binary
format, but that's rare),
Then how do people do pre- or post-commit reviews of their
changes?
I think you're thinking of code, or small prose files, or
something?
But I'm talking about 100GB zip files and other gigantic generated
binary blobs. When working with objects like that, the way one
reviews one's changes is usually not by diffing against the
previous version -- or if one *is* going to do that, then one
simply manually keeps a safe copy of the original version around
until the commit is done.
(I answered the "pre-commit" part of your question. I don't
understand the "post-commit" part -- even in the small-text-files
case, the pristine base files doesn't help with post-commit review
anyway.)
and 'svn commit' likewise just sends up the whole working file.
The
only thing a local base gets you is local 'svn revert', which
can be
nice, but many of us would happily give it up for large files
to avoid
the 2x local storage cost.
What about the ability to commit a change by uploading the delta
as
opposed to the new fulltext?
As issue #525 notes, with these kinds of files there is almost
never any useful delta for that anyway. You end up shipping the
entire new version across the wire on every commit. In fact, the
client wastes time trying to find a useful diff, when in the end
it's just going to have to send the fulltext (or the size
equivalent of the fulltext). It would have been faster just to
call sendfile() or something like that.
Really, for these kinds of files, the only thing local pristine
base files provides is local revert (at the cost of 2x storage).
Note that this is a purely client-side change, controlled
entirely by
client-side configuration. Different people can thus have
different
thresholds, depending on how much local disk space they have.
A server
would never even know if a client is or isn't saving
text-bases.
What would «svn status» of a modified file without a pristine
say?
How many network/worktree accesses would it involve?
Status would say "modified". The client-side still knows the
fingerprint (hash) of the pristine original, naturally.
Would it be possible to convert a file back and forth between
having and not having a pristine?
Sure, though it might involve the network, unless you haven't
modified the local file. I doubt that in practice that would be
done very often (but I could be wrong about that, who knows).
Suppose the user reverts the file without using «svn revert».
Would the
file show up as modified? Would a commit cause a null change to
the
file (new noderev with fulltext and props both identical to the
predecessor noderev's)?
See above about fingerprint.
How about (hard-|sym)linking the worktree file to the pristine
and
making it read-only until the user requests it to be made
editable?
Compare git-annex-unlock(1).
That's an interesting idea. I'm not sure how portable hard-links
are, and we can't rely on every file-modifying tool on the user's
system knowing about this.
I don't know, hmm. It's clever, but it also feels shaky and
unreliable to me. And even then, when I run 'svn edit' or
whatever to declare the file editable and make it read-write, I
usually *still* don't want 2x the storage! :-)
And supposing I am okay with the 2x storage for that file, there's
still the time cost: for a large file, copying the working file
from the pristine could take a while.
One can get the same effect by just making a copy off to the side
manually. In a sense, you're proposing that this "keep a safe
copy manually" behavior be built in to Subversion in the form of
the "make editable" command. But I'm not sure the extra
complexity in Subversion is worth it when such an easy workaround
is available.
But the idea is still attractive. I'm curious what others think.
There was also a request to store pristines compressed, but I
don't know
whether there's still demand for that.
Very often these kinds of files are already compressed, and so
can't be usefully compressed further.
In fact, in the actual use case I deal with most often, they are
literally gigantic compressed and encrypted .zip files, with each
one containing a bunch of PDFs and/or CSV files.
2) Add a new '--depth=directories' depth type to make it easy
to check out a
sparse tree, that is, a skeleton directory tree without the
files. Then,
within a given directory, you can do 'svn update --depth=files'
or check out
a particular file by name as needed. There's no ticket
associated with this
feature, as far as I know, but I can file one after this post
if people
think this idea is worthwhile.
Hmm.
Taking the FreeBSD ports tree
(https://svnweb.freebsd.org/ports/head/)
as an example, the obvious next feature request would be to also
fetch
the pkg-descr files from each port directory, even as new ports
are
added, in order to facilitate a local search of port descriptions
(via
«make search» in ports(7)).
Taking ASF's dist/release/ tree as an example, it might be useful
to
automatically retrieve only READMEs and detached signatures, but
not the
artifacts themselves.
Yes, those are plausible use cases. Behaviors like that are
easily scriptable with the feature I've described.
In general, I suspect «svn_boolean_t download_it_p(dirent *foo) {
return
foo->kind == svn_node_directory; }» is only half right: when FOO
is a
directory, download_it_p() generally gets the right answer, but
when FOO
is not a directory, download_it_p() sometimes false negatives.
That sounds like a restatement of what you said in the previous
two paragraphs :-). "False negative" seems like an odd term for
an intended and documented behavior; I mean, the behavior might
not be the best available, but there's nothing "false" about it
behaving as advertised.
Now, we *could* have a client-side behavior whereby the client can
be configured to fetch all files under a certain size, and then
files larger than that just magically don't get checked out until
you explicitly request one (or more) of them by name or by some
other unambiguous specification. Or it could be based on files
that either have or don't have a certain property, or mime-type,
etc.
I think right now I prefer the --depth=directories behavior
because it's simpler to explain and understand, and then if people
need something fancier they can script the fancier thing.
Separately, this sounds like it shouldn't be too hard to
prototype:
e.g., something along these lines:
.
svn checkout --depth=empty -- "$URL" foo
cd foo
svn update --parents --set-depth=empty -- $(LC_ALL=C svn info
-R -- "$URL" | grep-dctrl -F 'Node Kind' -ns Path directory |
sort)
.
where grep-dctrl(1) is a generic "grep a list of rfc822
paragraphs" tool.
I realize such prototypes wouldn't automatically deepen the
worktree as
new directories are added.
Or just actually script it in real life, building on the simple
and predictable --depth=directories.
I actually don't mean to defend the --depth=directories proposal
*too* enthusiastically here. I do agree that it's worth probing
to see if there is a better design available. But I'm conscious
of the tradeoff between use-case-coverage and comprehensibility.
The --depth=directories plan has the advantage of being simple to
explain and predictable. More complicated behaviors impose a
heavier mental load. For any users who haven't invested the
effort to understand, they would just be confused by Subversion's
behavior ("Why are some of this directory's files here but not all
of the files?")
There's also svn-viewspec.py.
Ah -- that could be a good tool for prototyping with (I didn't
know about it). Also, svn-viewspec.py could be extended to
support more decision criteria: size, svn:mime-type, arbitrary
properties, etc.
It's easy to see how these two features would work together to
make
Subversion a quite good system for managing blobs ("binary
large objects"):
⋮
* When someone needs a blob locally, they just check out (i.e.,
update) that
blob. There are various ways to do this, and it would even be
easy to
script new tools based on 'svn ls' that auto-complete the
filenames or
whatever.
«svn ls» is a network operation, so autocomplete scripts might
not like
to use it due to latency.
'svn ls' does not *have* to always be a network operation though.
If the client side has all the file metadata locally (names,
sizes, etc), then autocompletion can be offered purely locally,
based on a no-network-usage option to 'svn ls' or some other means
of accessing that local information.
When one is done with the file, one can keep it around or make
it disappear
locally. (Right now making it go away requires some fancy dance
moves, but we
could fix 'svn update --depth=empty FILENAME' to Do The Right
Thing, or we
could add a new flag, or whatever.
There already is «svn update --set-depth=exclude». An «svn
cleanup» is
required thereafter to vacuum the unused pristine
(https://subversion.apache.org/docs/release-notes/1.7#wc-pristines).
A command that involves a semi-mandatory 'svn cleanup' afterwards
is probably not a great experience for users... But in any case,
remember that for the files we're talking about here, there is no
pristine to be vacuumed anyway.
Also, people would presumably write scripts to help with blob
management in SVN, and eventually some of those scripts would
make
their way into our contrib/ area.)
contrib/ is deprecated.
Ah, I forgot that, thanks. (I remember knowing it at one point,
but... my brain stays the same size while the quantity of memories
grows; there is an obvious flaw in this situation.)
* Subversion's existing path-based authorization can be used so
that each
person's sparse checkout has the directories it needs and
doesn't have any
subtrees that it shouldn't have.
Authz is completely orthogonal to these feature requests; they
involve
no changes to authz implementation or configuration.
That is correct (hence the word "existing"). I was describing the
user experience holistically. For most of the real-world use
cases, at least the ones that I'm aware of, the path-based authz
system is very likely to be in use in these situations, and I
wanted to flag that.
Neither of these two proposed changes is huge. Of the two,
issue #525 is
bigger, and recently there is some interest in solving it (I
need to follow
up with some other folks who have shown interest, and I will
post back here
if it looks like we have a coalition). The --depth change
shouldn't be very
hard at all, though please correct me if I'm mistaken about
that.
Does it involve extending svn_depth_t with an
svn_depth_directories
value? That type is used all over the place, so there might be
a non-negligible amount of code to review for correctness (and
lack of
asserts) in the face of such an extension.
I assume it does involve that, although I'm not thinking about the
code-level details of the change at this stage, only the
user-visible behavior.
Also, I'm not sure whether new RA APIs would be required in order
to
implement the new behaviour performantly.
I'm not sure (but see above).
I wanted to circulate this to see if it sounds good to others,
and because
people might suggest refinements -- or even suggest better
ideas entirely
for managing blobs in Subversion.
Increase the svndiff window size, so a single byte addition at
the start
of the file doesn't result in $filesize/100KB delta ops?
Maybe? I *think* that's a rare case, and if it is then it's
probably not worth the implementation complexity. I believe that
when large blobby files get changed they tend to get changed all
over, even when the semantic change is small (partly because their
formats often have built-in compression or encryption).
Thanks for the review!
Best regards,
-Karl