On 30 Jul 2021, Daniel Shahaf wrote:
Karl Fogel wrote on Tue, Jul 27, 2021 at 20:24:32 -0500:
1) Make pristine text-base files optional. See issue #525 for details. In summary: currently, every large file uses twice the storage on the client side, and yet for most of these files there's little benefit. They're usually not plaintext, so 'svn diff' against the pristine base is pointless (unless you have some specialized diff tool for the particular binary
format, but that's rare),

Then how do people do pre- or post-commit reviews of their changes?

I think you're thinking of code, or small prose files, or something?

But I'm talking about 100GB zip files and other gigantic generated binary blobs. When working with objects like that, the way one reviews one's changes is usually not by diffing against the previous version -- or if one *is* going to do that, then one simply manually keeps a safe copy of the original version around until the commit is done.

(I answered the "pre-commit" part of your question. I don't understand the "post-commit" part -- even in the small-text-files case, the pristine base files doesn't help with post-commit review anyway.)

and 'svn commit' likewise just sends up the whole working file. The only thing a local base gets you is local 'svn revert', which can be nice, but many of us would happily give it up for large files to avoid
the 2x local storage cost.

What about the ability to commit a change by uploading the delta as
opposed to the new fulltext?

As issue #525 notes, with these kinds of files there is almost never any useful delta for that anyway. You end up shipping the entire new version across the wire on every commit. In fact, the client wastes time trying to find a useful diff, when in the end it's just going to have to send the fulltext (or the size equivalent of the fulltext). It would have been faster just to call sendfile() or something like that.

Really, for these kinds of files, the only thing local pristine base files provides is local revert (at the cost of 2x storage).

Note that this is a purely client-side change, controlled entirely by client-side configuration. Different people can thus have different thresholds, depending on how much local disk space they have. A server would never even know if a client is or isn't saving text-bases.

What would «svn status» of a modified file without a pristine say?
How many network/worktree accesses would it involve?

Status would say "modified". The client-side still knows the fingerprint (hash) of the pristine original, naturally.

Would it be possible to convert a file back and forth between
having and not having a pristine?

Sure, though it might involve the network, unless you haven't modified the local file. I doubt that in practice that would be done very often (but I could be wrong about that, who knows).

Suppose the user reverts the file without using «svn revert». Would the file show up as modified? Would a commit cause a null change to the
file (new noderev with fulltext and props both identical to the
predecessor noderev's)?

See above about fingerprint.

How about (hard-|sym)linking the worktree file to the pristine and making it read-only until the user requests it to be made editable?
Compare git-annex-unlock(1).

That's an interesting idea. I'm not sure how portable hard-links are, and we can't rely on every file-modifying tool on the user's system knowing about this.

I don't know, hmm. It's clever, but it also feels shaky and unreliable to me. And even then, when I run 'svn edit' or whatever to declare the file editable and make it read-write, I usually *still* don't want 2x the storage! :-)

And supposing I am okay with the 2x storage for that file, there's still the time cost: for a large file, copying the working file from the pristine could take a while.

One can get the same effect by just making a copy off to the side manually. In a sense, you're proposing that this "keep a safe copy manually" behavior be built in to Subversion in the form of the "make editable" command. But I'm not sure the extra complexity in Subversion is worth it when such an easy workaround is available.

But the idea is still attractive.  I'm curious what others think.

There was also a request to store pristines compressed, but I don't know
whether there's still demand for that.

Very often these kinds of files are already compressed, and so can't be usefully compressed further.

In fact, in the actual use case I deal with most often, they are literally gigantic compressed and encrypted .zip files, with each one containing a bunch of PDFs and/or CSV files.

2) Add a new '--depth=directories' depth type to make it easy to check out a sparse tree, that is, a skeleton directory tree without the files. Then, within a given directory, you can do 'svn update --depth=files' or check out a particular file by name as needed. There's no ticket associated with this feature, as far as I know, but I can file one after this post if people
think this idea is worthwhile.

Hmm.

Taking the FreeBSD ports tree (https://svnweb.freebsd.org/ports/head/) as an example, the obvious next feature request would be to also fetch the pkg-descr files from each port directory, even as new ports are added, in order to facilitate a local search of port descriptions (via
«make search» in ports(7)).

Taking ASF's dist/release/ tree as an example, it might be useful to automatically retrieve only READMEs and detached signatures, but not the
artifacts themselves.

Yes, those are plausible use cases. Behaviors like that are easily scriptable with the feature I've described.

In general, I suspect «svn_boolean_t download_it_p(dirent *foo) { return foo->kind == svn_node_directory; }» is only half right: when FOO is a directory, download_it_p() generally gets the right answer, but when FOO
is not a directory, download_it_p() sometimes false negatives.

That sounds like a restatement of what you said in the previous two paragraphs :-). "False negative" seems like an odd term for an intended and documented behavior; I mean, the behavior might not be the best available, but there's nothing "false" about it behaving as advertised.

Now, we *could* have a client-side behavior whereby the client can be configured to fetch all files under a certain size, and then files larger than that just magically don't get checked out until you explicitly request one (or more) of them by name or by some other unambiguous specification. Or it could be based on files that either have or don't have a certain property, or mime-type, etc.

I think right now I prefer the --depth=directories behavior because it's simpler to explain and understand, and then if people need something fancier they can script the fancier thing.

Separately, this sounds like it shouldn't be too hard to prototype:
e.g., something along these lines:
.
   svn checkout --depth=empty -- "$URL" foo
   cd foo
svn update --parents --set-depth=empty -- $(LC_ALL=C svn info -R -- "$URL" | grep-dctrl -F 'Node Kind' -ns Path directory | sort)
.
where grep-dctrl(1) is a generic "grep a list of rfc822 paragraphs" tool. I realize such prototypes wouldn't automatically deepen the worktree as
new directories are added.

Or just actually script it in real life, building on the simple and predictable --depth=directories.

I actually don't mean to defend the --depth=directories proposal *too* enthusiastically here. I do agree that it's worth probing to see if there is a better design available. But I'm conscious of the tradeoff between use-case-coverage and comprehensibility. The --depth=directories plan has the advantage of being simple to explain and predictable. More complicated behaviors impose a heavier mental load. For any users who haven't invested the effort to understand, they would just be confused by Subversion's behavior ("Why are some of this directory's files here but not all of the files?")

There's also svn-viewspec.py.

Ah -- that could be a good tool for prototyping with (I didn't know about it). Also, svn-viewspec.py could be extended to support more decision criteria: size, svn:mime-type, arbitrary properties, etc.

It's easy to see how these two features would work together to make Subversion a quite good system for managing blobs ("binary large objects"):
* When someone needs a blob locally, they just check out (i.e., update) that blob. There are various ways to do this, and it would even be easy to script new tools based on 'svn ls' that auto-complete the filenames or
whatever.

«svn ls» is a network operation, so autocomplete scripts might not like
to use it due to latency.

'svn ls' does not *have* to always be a network operation though. If the client side has all the file metadata locally (names, sizes, etc), then autocompletion can be offered purely locally, based on a no-network-usage option to 'svn ls' or some other means of accessing that local information.

When one is done with the file, one can keep it around or make it disappear locally. (Right now making it go away requires some fancy dance moves, but we could fix 'svn update --depth=empty FILENAME' to Do The Right Thing, or we
could add a new flag, or whatever.

There already is «svn update --set-depth=exclude». An «svn cleanup» is
required thereafter to vacuum the unused pristine
(https://subversion.apache.org/docs/release-notes/1.7#wc-pristines).

A command that involves a semi-mandatory 'svn cleanup' afterwards is probably not a great experience for users... But in any case, remember that for the files we're talking about here, there is no pristine to be vacuumed anyway.

Also, people would presumably write scripts to help with blob
management in SVN, and eventually some of those scripts would make
their way into our contrib/ area.)

contrib/ is deprecated.

Ah, I forgot that, thanks. (I remember knowing it at one point, but... my brain stays the same size while the quantity of memories grows; there is an obvious flaw in this situation.)

* Subversion's existing path-based authorization can be used so that each person's sparse checkout has the directories it needs and doesn't have any
subtrees that it shouldn't have.

Authz is completely orthogonal to these feature requests; they involve
no changes to authz implementation or configuration.

That is correct (hence the word "existing"). I was describing the user experience holistically. For most of the real-world use cases, at least the ones that I'm aware of, the path-based authz system is very likely to be in use in these situations, and I wanted to flag that.

Neither of these two proposed changes is huge. Of the two, issue #525 is bigger, and recently there is some interest in solving it (I need to follow up with some other folks who have shown interest, and I will post back here if it looks like we have a coalition). The --depth change shouldn't be very hard at all, though please correct me if I'm mistaken about that.

Does it involve extending svn_depth_t with an svn_depth_directories
value?  That type is used all over the place, so there might be
a non-negligible amount of code to review for correctness (and lack of
asserts) in the face of such an extension.

I assume it does involve that, although I'm not thinking about the code-level details of the change at this stage, only the user-visible behavior.

Also, I'm not sure whether new RA APIs would be required in order to
implement the new behaviour performantly.

I'm not sure (but see above).

I wanted to circulate this to see if it sounds good to others, and because people might suggest refinements -- or even suggest better ideas entirely
for managing blobs in Subversion.

Increase the svndiff window size, so a single byte addition at the start
of the file doesn't result in $filesize/100KB delta ops?

Maybe? I *think* that's a rare case, and if it is then it's probably not worth the implementation complexity. I believe that when large blobby files get changed they tend to get changed all over, even when the semantic change is small (partly because their formats often have built-in compression or encryption).

Thanks for the review!

Best regards,
-Karl

Reply via email to