Re: A two-part vision for Subversion and large binary objects.

Daniel Shahaf Sun, 01 Aug 2021 20:55:35 -0700

Karl Fogel wrote on Fri, Jul 30, 2021 at 22:42:25 -0500:
> On 30 Jul 2021, Daniel Shahaf wrote:
> > Karl Fogel wrote on Tue, Jul 27, 2021 at 20:24:32 -0500:
> > > 1) Make pristine text-base files optional.  See issue #525 for
> > > details.  In
> > > summary: currently, every large file uses twice the storage on the
> > > client
> > > side, and yet for most of these files there's little benefit.
> > > They're
> > > usually not plaintext, so 'svn diff' against the pristine base is
> > > pointless
> > > (unless you have some specialized diff tool for the particular
> > > binary
> > > format, but that's rare),
> > 
> > Then how do people do pre- or post-commit reviews of their changes?
> 
> I think you're thinking of code, or small prose files, or something?


No, I was thinking of formats that aren't amenable to unidiffing (e.g.,
png, a.out).  Let's continue with your concrete example of *.pdf.tar.gz.enc.

> But I'm talking about 100GB zip files and other gigantic generated binary
> blobs.  When working with objects like that, the way one reviews one's
> changes is usually not by diffing against the previous version --

Why?  That's exactly what I'm getting at: whether there _aren't_ diff
tools, or whether they exist but aren't used in the same way as in a
coding workflow — in which case, it might be better overall solution to,
say, extend, optimize, or advertise the diff-cmd feature.

And yes, I'm asking this question even if it's a gigantic, encrypted,
compressed zip file of PDFs.  There's nothing stopping people from
writing a diff tool that decrypts and does a 'diff -r' of the
pdftotext(1) of the contents.  (See, e.g., diffoscope(1).)

> or if one *is* going to do that, then one simply manually keeps a safe
> copy of the original version around until the commit is done.

What do you mean by "manually"?  It sounds like you're saying people
want to disable pristines and then to manually keep pristines
around(??).

When preparing a commit that changes the binary file, is the @BASE
version of the file read(2)ed at any point in the process of generating
the new version?  If not, perhaps an 'svn import' or svnmucc workflow
would suit those files better?  (I'd probably recommend svnmucc if only
because it has the -r argument to detect out-of-date errors with.  'svn
import URL' doesn't have this, does it?)

> (I answered the "pre-commit" part of your question.  I don't understand the
> "post-commit" part -- even in the small-text-files case, the pristine base
> files doesn't help with post-commit review anyway.)
> 

I was just trying to say that there presumably is _some_ way to review
changes to these files, so whatever it is, it may be able to be hooked
into «svn diff», making pristines useful.

> > > and 'svn commit' likewise just sends up the whole working file. The
> > > only thing a local base gets you is local 'svn revert', which can be
> > > nice, but many of us would happily give it up for large files to
> > > avoid
> > > the 2x local storage cost.
> > 
> > What about the ability to commit a change by uploading the delta as
> > opposed to the new fulltext?
> 
> As issue #525 notes, with these kinds of files there is almost never any
> useful delta for that anyway.  You end up shipping the entire new version
> across the wire on every commit.  In fact, the client wastes time trying to
> find a useful diff, when in the end it's just going to have to send the
> fulltext (or the size equivalent of the fulltext).  It would have been
> faster just to call sendfile() or something like that.

That sounds like a separate enhancement.

> > > Note that this is a purely client-side change, controlled entirely
> > > by
> > > client-side configuration.  Different people can thus have different
> > > thresholds, depending on how much local disk space they have. A
> > > server
> > > would never even know if a client is or isn't saving text-bases.
> > 
> > What would «svn status» of a modified file without a pristine say?
> > How many network/worktree accesses would it involve?
> 
> Status would say "modified".  The client-side still knows the fingerprint
> (hash) of the pristine original, naturally.

Okay.  What about hash collisions?  Presumably those won't be handled.
(Not disagreeing; just clarifying)

Pristines are generally diffed against the _detranslated_ working file.
With that in mind, how would svn:keywords and svn:eol-style be handled
on pristine-less files?  (I realize that's not your use-case.)

> > Would it be possible to convert a file back and forth between
> > having and not having a pristine?
> 
> Sure, though it might involve the network, unless you haven't modified the
> local file.  I doubt that in practice that would be done very often (but I
> could be wrong about that, who knows).
> 
> > Suppose the user reverts the file without using «svn revert». Would the
> > file show up as modified?  Would a commit cause a null change to the
> > file (new noderev with fulltext and props both identical to the
> > predecessor noderev's)?
> 
> See above about fingerprint.
> 
> > How about (hard-|sym)linking the worktree file to the pristine and
> > making it read-only until the user requests it to be made editable?
> > Compare git-annex-unlock(1).
> 
> That's an interesting idea.  I'm not sure how portable hard-links are, and
> we can't rely on every file-modifying tool on the user's system knowing
> about this.

Isn't that a good thing?  In this situation, changing the pristine would
be undesirable; and tools that break links would, if run on a worktree
file, change the worktree file but not the pristine.  The problem would
be the converse case: tools that _don't_ break links — that rewrite the
inode's contents, or that replace the (contents of the) file that
a symlink targets — would break the pristine files.  pristine files are
chmod'd read-only, though, aren't they?

> I don't know, hmm.  It's clever, but it also feels shaky and unreliable to
> me.  And even then, when I run 'svn edit' or whatever to declare the file
> editable and make it read-write, I usually *still* don't want 2x the
> storage!  :-)
> 

Perhaps pristineless files should be read-only (like svn:needs-lock
files are) until an 'svn edit' is run?

> And supposing I am okay with the 2x storage for that file, there's still the
> time cost: for a large file, copying the working file from the pristine
> could take a while.
> 

First, why would you copy the file from the pristine store?  Second,
I don't disagree that a copy could "take a while", but how long would it
take compared to writing to disk the edited version of the file, prior
to committing it?  Would unlink(2)ing the pristine file (either before
or after copying or regenerating the file) help?

> > > 2) Add a new '--depth=directories' depth type to make it easy to
> > > check out a
> > > sparse tree, that is, a skeleton directory tree without the files.
> > > Then,
> > > within a given directory, you can do 'svn update --depth=files' or
> > > check out
> > > a particular file by name as needed.  There's no ticket associated
> > > with this
> > > feature, as far as I know, but I can file one after this post if
> > > people
> > > think this idea is worthwhile.
> > 
> > Hmm.
> > 
> > Taking the FreeBSD ports tree (https://svnweb.freebsd.org/ports/head/)
> > as an example, the obvious next feature request would be to also fetch
> > the pkg-descr files from each port directory, even as new ports are
> > added, in order to facilitate a local search of port descriptions (via
> > «make search» in ports(7)).
> > 
> > Taking ASF's dist/release/ tree as an example, it might be useful to
> > automatically retrieve only READMEs and detached signatures, but not the
> > artifacts themselves.
> 
> Yes, those are plausible use cases.  Behaviors like that are easily
> scriptable with the feature I've described.
> 

The behaviour you propose is also easily scriptable, as I demonstrated
in my previous reply.  The question is just where to draw the line
between "easily scriptable" and "built in".  (And it's not a binary
question; there are middle grounds, such as tools/ and whatever replaces
contrib/.)

> > In general, I suspect «svn_boolean_t download_it_p(dirent *foo) { return
> > foo->kind == svn_node_directory; }» is only half right: when FOO is a
> > directory, download_it_p() generally gets the right answer, but when FOO
> > is not a directory, download_it_p() sometimes false negatives.
> 
> That sounds like a restatement of what you said in the previous two
> paragraphs :-).  "False negative" seems like an odd term for an intended and
> documented behavior; I mean, the behavior might not be the best available,
> but there's nothing "false" about it behaving as advertised.
> 
> Now, we *could* have a client-side behavior whereby the client can be
> configured to fetch all files under a certain size, and then files larger
> than that just magically don't get checked out until you explicitly request
> one (or more) of them by name or by some other unambiguous specification.
> Or it could be based on files that either have or don't have a certain
> property, or mime-type, etc.
> 
> I think right now I prefer the --depth=directories behavior because it's
> simpler to explain and understand, and then if people need something fancier
> they can script the fancier thing.
> 

--depth=directories maps well into the existing depth semantics, but
I am not sure it maps as well into use-cases.  Nobody walked into users@
requestnig a "Fetch the directory tree skeleton" feature.  The request
was to not keep pristines for some files.  That sounds like an attribute
of the file in a particular working copy.  So, I'd think more in the
direction of:

1. Having some per-working-copy state.

2. Allow to use that state to set/unset the "Don't keep a pristine" or
"Don't checkout" bit.

3. Allow the repository to offer default values for that state.

For instance, (1) could be (for the sake of example) the svn:thiswc:*
property namespace; we'd arrange for properties in that namespaces to be
excluded from status/commit; we'd have an svn:thiswc:no-pristine
property that's used to determine which files won't have a pristine, and
an svn:settings-proposed-by-the-repository-administrator:no-pristine
property, versioned in the usual way, that 'svn checkout' would seed
svn:thiswc:no-pristine with (unless the user opts out of this).  It's
basically the https://subversion.apache.org/faq#ignore-commit pattern
but with properties (and property namespaces, à la
tools/hook-scripts/persist-ephemeral-txnprops.py).

svn:thiswc:no-pristine could comprise a list of glob patterns à la
svn:ignore, a list of DSL expressions à la hg revsets, or anything in
between.

A "recommended" viewspec could similarly be packaged and even used by
default.

(svn:thiswc:* rather than svn:wc:* because the latter is/was a thing.)

> > Separately, this sounds like it shouldn't be too hard to prototype:
> > e.g., something along these lines:
> > .
> >    svn checkout --depth=empty -- "$URL" foo
> >    cd foo
> >    svn update --parents --set-depth=empty -- $(LC_ALL=C svn info    -R
> > -- "$URL" | grep-dctrl -F 'Node Kind' -ns Path directory |    sort)
> > .
> > where grep-dctrl(1) is a generic "grep a list of rfc822 paragraphs"
> > tool.
> > I realize such prototypes wouldn't automatically deepen the worktree as
> > new directories are added.
> 
> Or just actually script it in real life, building on the simple and
> predictable --depth=directories.
> 
> I actually don't mean to defend the --depth=directories proposal *too*
> enthusiastically here.  I do agree that it's worth probing to see if there
> is a better design available.  But I'm conscious of the tradeoff between
> use-case-coverage and comprehensibility. The --depth=directories plan has
> the advantage of being simple to explain and predictable.  More complicated
> behaviors impose a heavier mental load.  For any users who haven't invested
> the effort to understand, they would just be confused by Subversion's
> behavior ("Why are some of this directory's files here but not all of the
> files?")

And under the svn_depth_directories proposal, those users would simply
ask "Why are none of this directory's files here?".

In both cases, the answer is the standard Unix answer: "Because svn was
so ordered by you" [unless a repository admin viewspec is used by
default; so perhaps using such viewspecs should be opt-in rather than
opt-out].

Anyway, it _is_ fair to ask how this state, once reached, would be
indicated.  Presumably, we should do whatever we do to tell the user
about nodes excluded by depth (whether depth=exclude on the node, or
a non-infinity depth on its parent) or by the server ('server-excluded'
in wc.db).  I guess that'll be `svn info` and `svn status`.

> > There's also svn-viewspec.py.
> 
> Ah -- that could be a good tool for prototyping with (I didn't know about
> it).  Also, svn-viewspec.py could be extended to support more decision
> criteria: size, svn:mime-type, arbitrary properties, etc.

There's also a --x-viewspec option in svn.c.

> > > When one is done with the file, one can keep it around or make it
> > > disappear
> > > locally. (Right now making it go away requires some fancy dance
> > > moves, but we
> > > could fix 'svn update --depth=empty FILENAME' to Do The Right Thing,
> > > or we
> > > could add a new flag, or whatever.
> > 
> > There already is «svn update --set-depth=exclude».  An «svn cleanup» is
> > required thereafter to vacuum the unused pristine
> > (https://subversion.apache.org/docs/release-notes/1.7#wc-pristines).
> 
> A command that involves a semi-mandatory 'svn cleanup' afterwards is
> probably not a great experience for users...


Yeah, it's a documented bug.  The link has the bug number.

> But in any case, remember that for the files we're talking about here,
> there is no pristine to be vacuumed anyway.

I was going by what you had just said, that dancing was required to get
rid of the file.

> 
> > > Also, people would presumably write scripts to help with blob
> > > management in SVN, and eventually some of those scripts would make
> > > their way into our contrib/ area.)
> > 
> > contrib/ is deprecated.
> 
> Ah, I forgot that, thanks.  (I remember knowing it at one point, but... my
> brain stays the same size while the quantity of memories grows; there is an
> obvious flaw in this situation.)

I don't see what the "obvious flaw" is.  Given that you remembered
knowing about contrib/'s deprecation at one point, it sounds like the
problem wasn't running out of memory space, but poor indexing.  I think
CVS have a patch for that.

:-),

Daniel

P.S.  Could you please link this thread from #525 if you haven't
already?

Re: A two-part vision for Subversion and large binary objects.

Reply via email to