Re: A two-part vision for Subversion and large binary objects.

Karl Fogel Fri, 30 Jul 2021 20:42:41 -0700

On 30 Jul 2021, Daniel Shahaf wrote:

Karl Fogel wrote on Tue, Jul 27, 2021 at 20:24:32 -0500:
1) Make pristine text-base files optional. See issue #525 fordetails. Insummary: currently, every large file uses twice the storage onthe clientside, and yet for most of these files there's little benefit.They'reusually not plaintext, so 'svn diff' against the pristine baseis pointless(unless you have some specialized diff tool for the particularbinary
format, but that's rare),
Then how do people do pre- or post-commit reviews of theirchanges?

I think you're thinking of code, or small prose files, orsomething?

But I'm talking about 100GB zip files and other gigantic generatedbinary blobs. When working with objects like that, the way onereviews one's changes is usually not by diffing against theprevious version -- or if one *is* going to do that, then onesimply manually keeps a safe copy of the original version arounduntil the commit is done.

(I answered the "pre-commit" part of your question. I don'tunderstand the "post-commit" part -- even in the small-text-filescase, the pristine base files doesn't help with post-commit reviewanyway.)

and 'svn commit' likewise just sends up the whole working file.Theonly thing a local base gets you is local 'svn revert', whichcan benice, but many of us would happily give it up for large filesto avoid
the 2x local storage cost.
What about the ability to commit a change by uploading the deltaas
opposed to the new fulltext?

As issue #525 notes, with these kinds of files there is almostnever any useful delta for that anyway. You end up shipping theentire new version across the wire on every commit. In fact, theclient wastes time trying to find a useful diff, when in the endit's just going to have to send the fulltext (or the sizeequivalent of the fulltext). It would have been faster just tocall sendfile() or something like that.

Really, for these kinds of files, the only thing local pristinebase files provides is local revert (at the cost of 2x storage).

Note that this is a purely client-side change, controlledentirely byclient-side configuration. Different people can thus havedifferentthresholds, depending on how much local disk space they have.A serverwould never even know if a client is or isn't savingtext-bases.
What would «svn status» of a modified file without a pristinesay?
How many network/worktree accesses would it involve?

Status would say "modified". The client-side still knows thefingerprint (hash) of the pristine original, naturally.

Would it be possible to convert a file back and forth between
having and not having a pristine?

Sure, though it might involve the network, unless you haven'tmodified the local file. I doubt that in practice that would bedone very often (but I could be wrong about that, who knows).

Suppose the user reverts the file without using «svn revert».Would thefile show up as modified? Would a commit cause a null change tothe
file (new noderev with fulltext and props both identical to the
predecessor noderev's)?


See above about fingerprint.

How about (hard-|sym)linking the worktree file to the pristineandmaking it read-only until the user requests it to be madeeditable?
Compare git-annex-unlock(1).

That's an interesting idea. I'm not sure how portable hard-linksare, and we can't rely on every file-modifying tool on the user'ssystem knowing about this.

I don't know, hmm. It's clever, but it also feels shaky andunreliable to me. And even then, when I run 'svn edit' orwhatever to declare the file editable and make it read-write, Iusually *still* don't want 2x the storage! :-)

And supposing I am okay with the 2x storage for that file, there'sstill the time cost: for a large file, copying the working filefrom the pristine could take a while.

One can get the same effect by just making a copy off to the sidemanually. In a sense, you're proposing that this "keep a safecopy manually" behavior be built in to Subversion in the form ofthe "make editable" command. But I'm not sure the extracomplexity in Subversion is worth it when such an easy workaroundis available.


But the idea is still attractive.  I'm curious what others think.

There was also a request to store pristines compressed, but Idon't know
whether there's still demand for that.

Very often these kinds of files are already compressed, and socan't be usefully compressed further.

In fact, in the actual use case I deal with most often, they areliterally gigantic compressed and encrypted .zip files, with eachone containing a bunch of PDFs and/or CSV files.

2) Add a new '--depth=directories' depth type to make it easyto check out asparse tree, that is, a skeleton directory tree without thefiles. Then,within a given directory, you can do 'svn update --depth=files'or check outa particular file by name as needed. There's no ticketassociated with thisfeature, as far as I know, but I can file one after this postif people
think this idea is worthwhile.
Hmm.
Taking the FreeBSD ports tree(https://svnweb.freebsd.org/ports/head/)as an example, the obvious next feature request would be to alsofetchthe pkg-descr files from each port directory, even as new portsareadded, in order to facilitate a local search of port descriptions(via
«make search» in ports(7)).
Taking ASF's dist/release/ tree as an example, it might be usefultoautomatically retrieve only READMEs and detached signatures, butnot the
artifacts themselves.

Yes, those are plausible use cases. Behaviors like that areeasily scriptable with the feature I've described.

In general, I suspect «svn_boolean_t download_it_p(dirent *foo) {returnfoo->kind == svn_node_directory; }» is only half right: when FOOis adirectory, download_it_p() generally gets the right answer, butwhen FOO
is not a directory, download_it_p() sometimes false negatives.

That sounds like a restatement of what you said in the previoustwo paragraphs :-). "False negative" seems like an odd term foran intended and documented behavior; I mean, the behavior mightnot be the best available, but there's nothing "false" about itbehaving as advertised.

Now, we *could* have a client-side behavior whereby the client canbe configured to fetch all files under a certain size, and thenfiles larger than that just magically don't get checked out untilyou explicitly request one (or more) of them by name or by someother unambiguous specification. Or it could be based on filesthat either have or don't have a certain property, or mime-type,etc.

I think right now I prefer the --depth=directories behaviorbecause it's simpler to explain and understand, and then if peopleneed something fancier they can script the fancier thing.

Separately, this sounds like it shouldn't be too hard toprototype:
e.g., something along these lines:
.
   svn checkout --depth=empty -- "$URL" foo
   cd foo
svn update --parents --set-depth=empty -- $(LC_ALL=C svn info-R -- "$URL" | grep-dctrl -F 'Node Kind' -ns Path directory |sort)
.
where grep-dctrl(1) is a generic "grep a list of rfc822paragraphs" tool.I realize such prototypes wouldn't automatically deepen theworktree as
new directories are added.

Or just actually script it in real life, building on the simpleand predictable --depth=directories.

I actually don't mean to defend the --depth=directories proposal*too* enthusiastically here. I do agree that it's worth probingto see if there is a better design available. But I'm consciousof the tradeoff between use-case-coverage and comprehensibility.The --depth=directories plan has the advantage of being simple toexplain and predictable. More complicated behaviors impose aheavier mental load. For any users who haven't invested theeffort to understand, they would just be confused by Subversion'sbehavior ("Why are some of this directory's files here but not allof the files?")

There's also svn-viewspec.py.

Ah -- that could be a good tool for prototyping with (I didn'tknow about it). Also, svn-viewspec.py could be extended tosupport more decision criteria: size, svn:mime-type, arbitraryproperties, etc.

It's easy to see how these two features would work together tomakeSubversion a quite good system for managing blobs ("binarylarge objects"):
⋮
* When someone needs a blob locally, they just check out (i.e.,update) thatblob. There are various ways to do this, and it would even beeasy toscript new tools based on 'svn ls' that auto-complete thefilenames or
whatever.
«svn ls» is a network operation, so autocomplete scripts mightnot like
to use it due to latency.

'svn ls' does not *have* to always be a network operation though.If the client side has all the file metadata locally (names,sizes, etc), then autocompletion can be offered purely locally,based on a no-network-usage option to 'svn ls' or some other meansof accessing that local information.

When one is done with the file, one can keep it around or makeit disappearlocally. (Right now making it go away requires some fancy dancemoves, but wecould fix 'svn update --depth=empty FILENAME' to Do The RightThing, or we
could add a new flag, or whatever.
There already is «svn update --set-depth=exclude». An «svncleanup» is
required thereafter to vacuum the unused pristine
(https://subversion.apache.org/docs/release-notes/1.7#wc-pristines).

A command that involves a semi-mandatory 'svn cleanup' afterwardsis probably not a great experience for users... But in any case,remember that for the files we're talking about here, there is nopristine to be vacuumed anyway.

Also, people would presumably write scripts to help with blob
management in SVN, and eventually some of those scripts wouldmake
their way into our contrib/ area.)
contrib/ is deprecated.

Ah, I forgot that, thanks. (I remember knowing it at one point,but... my brain stays the same size while the quantity of memoriesgrows; there is an obvious flaw in this situation.)

* Subversion's existing path-based authorization can be used sothat eachperson's sparse checkout has the directories it needs anddoesn't have any
subtrees that it shouldn't have.
Authz is completely orthogonal to these feature requests; theyinvolve
no changes to authz implementation or configuration.

That is correct (hence the word "existing"). I was describing theuser experience holistically. For most of the real-world usecases, at least the ones that I'm aware of, the path-based authzsystem is very likely to be in use in these situations, and Iwanted to flag that.

Neither of these two proposed changes is huge. Of the two,issue #525 isbigger, and recently there is some interest in solving it (Ineed to followup with some other folks who have shown interest, and I willpost back hereif it looks like we have a coalition). The --depth changeshouldn't be veryhard at all, though please correct me if I'm mistaken aboutthat.
Does it involve extending svn_depth_t with ansvn_depth_directories
value?  That type is used all over the place, so there might be
a non-negligible amount of code to review for correctness (andlack of
asserts) in the face of such an extension.

I assume it does involve that, although I'm not thinking about thecode-level details of the change at this stage, only theuser-visible behavior.

Also, I'm not sure whether new RA APIs would be required in orderto
implement the new behaviour performantly.


I'm not sure (but see above).

I wanted to circulate this to see if it sounds good to others,and becausepeople might suggest refinements -- or even suggest betterideas entirely
for managing blobs in Subversion.
Increase the svndiff window size, so a single byte addition atthe start
of the file doesn't result in $filesize/100KB delta ops?

Maybe? I *think* that's a rare case, and if it is then it'sprobably not worth the implementation complexity. I believe thatwhen large blobby files get changed they tend to get changed allover, even when the semantic change is small (partly because theirformats often have built-in compression or encryption).


Thanks for the review!

Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Reply via email to