Ben Franksen writes:
 > Am 29.03.2018 um 10:08 schrieb Stephen J. Turnbull:

 > Internally we do use references, similar to git (we refer to patches,
 > inventories, and trees via content hash). But in contrast to git, these
 > are not exposed as a user visible concept. Tags are somewhat special;
 > they do serve to identify versions, i.e. what git uses refs for. But
 > since their behavior is specified in terms of patch dependencies, they
 > are not really an exception to the rule.

I think you're taking the implementation too seriously.  Any user who
understands what a ref is will say "a Darcs tag is too a ref!" I

 > > From my point of view, all you've said is "people don't grok DAGs". :-)
 > Ah well, i do keep telling people that the memory in their computers is
 > nothing but an array of bytes, but they still find them complicated, so
 > perhaps they just don't grok arrays?

I don't think they're parallel.  The DAG *is* the highest-level
structure you need to understand what git is doing from the user's
point of view, and if you draw a DAG with arrows and circles, any
four-year-old can trace branches for you (maybe you'll have to use the
dual graph, but I bet most four-year-olds understand "but backwards").
But "array of bytes" isn't very useful in understanding, say, a
compiled C program's control flow.

I don't mean to argue that so many highly-intelligent VCS users need
to go find a four-year-old mentor.  Obviously there's (a) something
("a lot", if you must!) wrong with the git documentation and (b) a
mismatch between people's intuitions about what "VCS is good for" and
what git exposes of its activity to the user.  I don't understand (b),
and until I do, I see no point in trying to write a better git book to
address (a). :-)

 > > I would think "link 'em all" is a better default for most projects,
 > > except that in git branch refs are really lightweight, so developers
 > > are likely to have a bunch of obsolete or experimental branches lying
 > > around that you don't want.
 > Good point. I was thinking about "official" branches only, not
 > experimental/feature/whatever branches that anyone can and does create
 > all the time.

How do you identify "official"?  I think that by now the great
majority of git users use it because their projects are hosted on
GitHub or GitLab, and mostly it's obvious what the "official" branches
are.  But the maintainers of git, who target themselves in designing
new features for it, are very much peer-to-peer sharers, cloning or
pulling branches from a wide assortment of repos.  This is what
"distributed" means to them.  I doubt they'd be willing to make
"export all branches on clone" a default, and it's not clear to me
that the "I just want to see the mainline" aren't the majority.

 > (But then you also don't want the commits on these branches,
 > right?)

I'm not sure what you mean by this.  The way git is currently
designed, you *cannot* commit to tracking branches; they are opaque
behind the "remote" mechanism.  As far as the user is concerned, they
act like tags except when fetching their current state from the remote
repo.  If you want to contribute to such a branch, you do something

    git branch ben-master origin/master

(git recognizes that origin/master is a tracking branch and
automatically sets up ben-master so that you pull from origin/master
when ben-master is checked out).  You can commit to ben-master, and
(if you have remote permission) push from there to the remote, which
will (if successful) be mirrored in origin/master.

 > Which [of "checkout master only" and "checkout all 'official'
 > branches"] is the better default then depends on the preferred
 > work-flow in your project.


 > > This is how Subversion works (and CVS before it and Bazaar
 > > "lightweight checkouts" after it).  With that restriction, distributed
 > > development is painful.  Avoiding that restriction is why Arch,
 > > BitKeeper, git, Mercurial, Monotone, Bazaar, ... were developed.
 > > Darcs, too. :-)
 > I don't understand. What has distributed versus centralized to do with
 > it? I'd say in a centralized system there is only one "remote", so the
 > question is moot. Is that what you mean?

No, it's not.  By distributed *development* (as opposed to distributed
VCS), I mean a situation where multiple people are updating a single
mainline.  Even with distributed VCS, there's normally an "official"
repo with an "official" master branch in it.  This means that from an
individual developer's point of view, the state of master is a triple:
(1) what's actually in the official repo (unknown; another dev may
    have updated),
(2) what's recorded in your workspace's metadata, and 
(3) what's actually in your working tree.[1]
A centralized VCS doesn't allow you to commit unless (1) == (2).  A
distributed VCS does.  But I think a lot of users' intuitions are
informed (though not fully determined) by this constraint that (1) and
(2) are supposed to be "the same".

 > This is all uncontroversial IMO and has nothing to do with the
 > question we are discussing.

This makes me unsure what you think the questions we discuss are.  I
think that one question is about how in a DAG-based system it's always
possible to identify all past states of all relevant upstream branches
(although you may not be able to recover the names after merges),
whereas in Darcs this is fuzzy (you can only do this for tags).

 > > Darcs avoids all this by modeling a branch as a history-less set of
 > > patches.  Of course the semantics of text require certain implicit
 > > dependencies (you can't delete a line that doesn't exist in the text).
 > (there are systems that do allow that, but not Darcs)

token-replace has an analogous aspect.  But I didn't mean "allows you
to specify deletion of a line that doesn't exist", I mean that it's
useful for some kinds of patches to obey such contextual constraints
because people think of them that way.

 > I am not sure I want "semantic" dependencies. The best a general
 > purpose text based tool can give you is a crude approximation of
 > it.

Of course, a semantic VCS would require a very large amount of
"knowledge" of the language(s) used in the "document".  Such a VCS
could reasonably be called "AI" in the classic sense of passing the
Turing test in a limited domain.

 > The version (DAG) based systems approximate on the "safe" side: any
 > change, even the smallest, semantically irrelevant one, introduces a
 > dependency.
 > Darcs chooses to err on the "flexible" side: by default only the minimum
 > (technically necessary) dependencies are introduced.

I don't think this comparison is entirely accurate.  All DAG-based
systems permit cherry-picking and rebasing, although those like
Mercurial and Bazaar do try to deprecate rebase.  In git they are
first-class operations.

Of course they do insist that new history has been created, and that
is certainly useful from the point of view of "test-driven
development", for example: *none* of the commits on the rebased branch
have been tested in context; any commits that are intended to
represent "tested versions" must be checked out and the tests run.
There are also forensic issues which matter in the proprietary world
(eg, determining priority of a patent application, or independent
development in case of copyright infringement claims).  But (at least
in git) this doesn't prevent rearrangement of code changes to suit the
needs of the developer.

By the way, it's never been clear to me that patch algebra is more
effective than the brute-force "try a cherry-pick or rebase and see if
it works" approach of DAG-based systems.  Leaving aside the
meta-patches (file renaming, creating empty directories) where git is
clearly deficient, does patch algebra allow you to avoid some
conflicts that would occur in a DAG-based system?  If not, what is the
great advantage of patch algebra from your point of view?  Is it
related to the ability to claim the same branch identity for two
workspaces that "haven't diverged too much", where a git rebase in a
published branch all too often results in an unusable mess of

 > An interesting point I hadn't considered yet. But can git give the same
 > name to different URLs? I think it cannot, else how would it know what
 > it should do when I say 'git pull <remote>' (i.e. should it use ssh or
 > http?). So how do "remotes" help to manage the different URLs for the
 > "same" remote repo?

Invariably it is physically the same storage on disc.  Management is a
matter of setting some .git/config variables, like remote.origin.url
for the URL of the "origin" repo, and branch.ben-master.remote for the
remote to use by default when the "ben-master" branch is checked out.
This can be overridden by a command line repo and branch spec, but in
my experience it's always preferable to set up a new remote and use
that instead.  For example, here's my GNU Mailman config (annotated
for this email):

    # boilerplate: this is all default
            repositoryformatversion = 0
            filemode = true
            bare = false
            logallrefupdates = true
            ignorecase = true
            precomposeunicode = true
    # automatically set up by "git clone"
    [remote "origin"]
            # An ssh (rsh legacy) "URL".  ssh: and git: URLs are often
            # served by simple servers such that if you can log in at all,
            # you have push access, but GitLab, like GitHub, manages
            # push permission itself.
            url =
            fetch = +refs/heads/*:refs/remotes/origin/*
    # Where I push "pull request" branches as well as my "public"
    # experimental branches.
    [remote "steve"]
            url =
    # A coworker's experimental branch that I helped to debug.
    # The fetch spec means I get only that branch, and that this branch
    # tracks his (there is no corresponding branch for me to commit to;
    # checking this out means a "detached HEAD").
    [remote "mark"]
            url =
            fetch = dmarc:refs/remotes/mark/dmarc
    # The "merge" spec "ties" my master branch to the "official" one, both
    # for pull and for push.
    [branch "master"]
            remote = origin
            merge = refs/heads/master

 > > You don't have to like it, but there are strong reasons for doing it
 > > this way if you want your development organization to scale to many
 > > developers working independently on anything they want to.
 > I must say that I lack the experience of working at a scale of something
 > like the Linux kernel. But I do value the possibility to push and pull
 > patches from any repo to any other as darcs allows me to

git allows pulling and pushing branches to any other repository,
assuming you have appropriate permissions on both repositories.  Why
would you think it different?  Of course we don't have permission to
push to if *we* could, so could "Fancy Bear"!

 > and I am using that feature in practice. I am pretty sure the Darcs
 > model would scale to a large number of developers but I have no
 > proof.

I'm not sure what you mean by "model" and what you mean by "scale".
The problem with repo per branch per version is just multiplication,
when you realize that every repo of every developer is a different
version.  Storage blows up, the naming conflicts will be frequent
unless you're willing to endure network outages and delays, and URLs
for personal repos are often long and/or unintuitive.

 > > If there are multiple people with push permission, your *VCS*
 > > will need a conceptual way of referring to content that is
 > > intended to end up in the "official" branch that diverges from
 > > other content also destined for that branch (or already
 > > incorporated in that branch).
 > Of course. But then, assuming I do not want to push changes to "master"
 > because this is how the project is organized, then I just don't do it,
 > right?

Long experience shows that is easier said than avoided. ;-)  This is
why git has an "url" variable for each remote used for fetch and push
by default, but also provides "pushUrl" in case you want to
differentiate the destination by operation.

 > >  > I should rather have created my own branch and committed there, so
 > >  > the remote owner of the branch can integrate my changes with a
 > >  > merge?
 > > 
 > > I'm not saying "you should", I'm saying "you do".  In a DVCS, by
 > > committing locally you *do* create your own branch.
 > Yes, exactly. I took all this for granted, which is why I asked "so
 > what"?

Rebasing, which doesn't screw up history in Darcs the way it does in a
DAG-based system because Darcs deliberately doesn't keep history.
Despite what you say you take for granted, this allows Darcs users to
identify the remote branch with their own.  Darcs conforming to users'
expectations in this way may be a good thing in practice, but I think
this expectation is one of the main causes of inability to grasp git.
(That may or may not be a problem, depending on which projects you
want to contribute to, of course!)

 > With branches, things may be different. It may make sense to have push
 > behave in a "safe" way by default, that is, create a different remote
 > branch.

This practice was a huge annoyance with the default set up of
Mercurial (as of the time XEmacs converted to Mercurial).  I would
never recommend it: most such pushes would just get lost.  In git,
though, you always push to a specified branch (which may be implicit
in a .git/config variable).

 > I have become wary of transplanting /any/ kind of expectations from
 > Darcs to git.

<snort/>  That does make sense!

 > > Only because you don't have multiple branches in one repo, so URL of
 > > repo == name of branch == only ref that ever matters to you, and it's
 > > mostly trivial to keep track of "here vs there".
 > Yes, there is some truth to that. I would very much like to retain this
 > conceptual simplicity even when/if we add branches to Darcs. I think
 > that if we use the current model as a guide, then we can achieve that:
 > A URL+branch in Darcs-with-branches behaves like a URL does now. A
 > branch alone is short for "the local repo"+branch. A URL alone means
 > URL+"default branch", where "default branch" is the name of the branch
 > you are on, unless configured otherwise. Everything else remains as it is.

This is what happens in git now, except that you are able to set your
own defaults in .git/config, and provide aliases for URLs (the
remotes).  You can argue that remotes provide more confusion than
convenience if you like, but several years of experience have shown
that for the vast majority of git users it's the other way around.
Whatever confusion is experienced due to remotes, the convenience
gained is much greater.

This is not true for branches.  "Colocated branches" (ie, the many
branches per repo model) do seem to cause confusion.  My guess is that
a Darcs-with-branches would have the same problem.

 > Another point where it is problematic to transfer concepts naively. Yes,
 > in a way this is what obliterate would do, more specifically 'darcs
 > obliterate --not-in-remote'. This is not something I would associate
 > with "rollback" in the transactional sense, even though i admit that
 > technically it is. (My view of transactions is that they are short-lived
 > deviations from the one-state-for-all norm.)

In context, "short-lived deviation" is exactly the sense I meant: in
case of a merge with way too many conflicts, you want to "rollback" to
the pre-merge state.

 > >  > Neither makes the rest of the statement any sense to me: what you
 > >  > committed and how to get back to where you started could be
 > >  > calculated by comparing the local with the remote DAG, right? So
 > >  > what's the problem?
 > > 
 > > You don't know when the ref has moved in the remote DAG (git doesn't
 > > record timestamps for push, and both author and committer commit
 > > timestamps can be forged at commit time, which is different from push
 > > time), so that's not useful.
 > Hm. When I transfer this back to Darcs, the remote repo has accumulated
 > more patches. As long as I do not pull them, my reference point is
 > unchanged, so 'darcs obliterate --not-in-remote' really brings me back
 > to where I started.

I think that's the right analogy.  So it is a rollback.

 > I see. In my view of things this is yet another point where the Darcs
 > model gives you an advantage without any additional effort.

Because patch algebra allows you to compute that there *will* be a
conflict without applying any patches?  But I recall lots of cases
where I'd get a nasty conflict in the workspace that I wanted to back
out of in Darcs.  obliterate would require a checkout in those cases
to restore sanity in the working tree (I believe that was implicit).
That's no different from git.

 > But then why do all these commands have literally hundreds of
 > options and the man pages "explaining" them are stuffed with
 > technical details that no one except a hand full of experts
 > understand?

Humans gonna human. ;-)

 > Try to google "how to xyz with git" where xyz is something simple
 > occurring in your daily work. You'll find all sort of answer on
 > stack-overflow, none of which is a single command with no options.

You know what?  I have never done that.  Not once.  I've written the
answers, though (see Python's PEP 374, IIRC).  Nor do I use the myriad
of options (except to filter-branch, but that isn't a "command", it's
a "Turing-complete shell"! :-)

 > I have no problem with a tool that is powerful and allows me to play all
 > kinds of tricks, as long as these tricks don't violate the internal
 > invariants that hold everything together.

git is internally invariant, with the single exception of git-prune,
and git-fsck and git-gc which call it under certain circumstances.
The problem that people run into is that it's not always obvious what
commands will do to the "external handles" that we call "refs".

As long as you're committed up, you've always been able to do anything
with git that git can do and be safe, with the exceptions of "git gc
--prune=all" and "git prune", because those delete unreferenced
objects that you might want to restore references for.  And since the
advent of the reflog, you can do those, too.  Back in the bad old
days, all you needed to do was tag the branch to preserve before doing
something dangerous (usually, rebase).  Now with the reflog you don't
even need to do that.

 > In git, when you rebase, it is the user's responsibility to ensure
 > consistency, and humans are notoriously bad at things like that.

Sigh.  This simply isn't true.  *The DAG is immutable.*  Yes, in the
Dark Ages there was no reflog so you could seriously confuse yourself
and others by rebasing.  Now it's pretty trivial to clean up "you
moved the ref behind my back" messes.

 > Ugh. Three branches.

There are *always* at least three branches: remote repo, local repo,
working tree.  This is just as true of Darcs.  Darcs just makes it
easier to maintain the fiction that they're all near-identical
versions of "the same thing", by (1) abandoning consistency of
temporal history with the underlying object database, and (2)
requiring that persistently different branches be given their own

(1) and (2) are a very good tradeoff for many users.  My guess is that
they would turn out to be a bad bet if you tried to manage the Linux
kernel or GCC development with Darcs.  (Let me say here that I think
the GHC switch was an unfortunate historical accident, not evidence
for this guess.  I think GHC is big enough to test scalability of the
Darcs model, and was then, but that's not why they switched AIUI.)

 > It's just that in Haskell there is (currently) no way to express that
 > proof /inside the language/ so it can be automatically checked by the
 > compiler.

Well, there's also the fact that there's a perfectly good term for an
endofunctor that may or may not be a monad, and needs to be proved to
be a monad before you treat it as one: endofunctor.  But monad sounds
cooler so they appropriated it.

 > We will see if we can come up with something better. I like it if my
 > tool does the obvious thing right out of the box without any need for me
 > to configure it.

As far as I can tell, modern git does that, except that you and I
disagree on whether it's obvious that all of the remote's branches
should be linked to local branches or only the remote's default, and:

 > >  > I also detest that I have to register remote repos locally in
 > >  > order to refer to them in commands, giving them some arbitrary
 > >  > local name, when they already have a perfectly good
 > >  > universally valid name (the URL).

I disagree that the remote's URL is a *perfectly* good name, because
the version it refers to is *unstable*.  In git, you know that "git
diff origin/master" will give the same result every time, until you
fetch that branch.  In repo-per-branch models, you don't know that,
because somebody may commit in the other repo.

 > """
 > When a local branch is started off a remote-tracking branch, Git sets
 > up the branch (specifically the branch.<name>.remote and
 > branch.<name>.merge configuration entries) so that git pull will
 > appropriately merge from the remote-tracking branch. This behavior may
 > be changed via the global branch.autoSetupMerge configuration flag. That
 > setting can be overridden by using the --track and --no-track options,
 > and changed later using git branch --set-upstream-to.
 > """
 > I am getting headaches from this. I think it means (but I am far from
 > sure) that to get the behavior I want, I should checkout a remote
 > tracking branch and then start a local branch from that?

In fact what it means is that you normally don't need --track because
that's the default, and you don't need to checkout the tracking branch
(you only do that if you want to look at the corresponding working
tree).  You just need to define the remote:

    git remote add ben

and then the local branch:

    git branch somebranch ben/somebranch

 > > git users usually have a bunch of obsolete or experimental
 > > branches lying about, that you would not be interested in
 > > tracking.
 > Granted. So how does git know which branches you are interested in and
 > which not? Simple: you are (supposed to be) interested in whatever the
 > remote has named "master". No?

No, that default is only for a clone, and it's whatever is checked out
in the source repo, which is usually "master" for a public repo.  But
on a clone by default you get the whole DAG, so it's easy to make
local branches from the tracking "branches":

    git branch another-branch origin/another-branch

For different remote, you need to configure the remote, which by
default will create tracking branches in the local repo for all the
remote branches.  To create a local version that starts synchronized
to a particular tracking branch, you must explicitly specify the
tracking branch as above.

 > Once again, I prefer to (have to) tell it explicitly which branch I
 > want.

And also specify which tag or "head" explicitly?  Or is that "silly"
because you almost always want "head"? :-)  AFAIK, most git users are
happy with the default because it's very frequently right, and when
it's not, it's a special enough case that you don't forget.

 > For personal repo where a developer uploads all the stuff he/she is
 > working on, you could clone the one branch you are interested
 > in. (I don't know if you can clone a branch in git.)

You can.  Like "shallow clones", in modern git there's really only a
point if you are space- or bandwidth-constrained.

 > BTW, don't people clean up their repos every now and then
 > i.e. throw away obsolete branches?

To some extent, yes, but in modern practice where people use a staging
repo on GitHub where they set up a branch per pull request, active
developers often have many (say up to a dozen) pull requests open at
any given time.  Of course if you're a committer you probably can keep
that under a half-dozen.  Local repos in this model tend to get thrown
away as a unit.

 > What about the sharing with colleagues? (Of configuration changes or new
 > features or fixes that aren't ready for upstream.) As I understand, in
 > your work-flow these are all either local branches in a repo in your
 > home dir, perhaps on your own computer. Or else, you push them for all
 > the world to see in a branch to the upstream repo. Both of which aren't
 > ideal IMO. You really want a third repo in between upstream and local
 > for that.

Yes, as I describe above these days it's typically on GitHub.

 > In git this must be a bare repo, so you cannot and aren't supposed
 > to work in it, right?

I guess; in practice on GitHub you can't work in it.  I suppose
setting it up as a bare repo does help prevent "wrong cwd" boo-boos.

 > > Well, "origin" *is* an URI, relative to the local repository, if
 > > you're in one. 
 > This is a contradiction in terms. The 'U' in URI stands for
 > 'universal'.

URIs come in relative and absolute forms.  "origin" is a relative

 > I did that a few days ago, because setting up the remotes correctly is
 > just too much hassle for me.

If you say so.  "git remote <alias> <url>" is all I've ever needed,

 > Whenever you call something "immediately plausible" in git, it
 > feels to me like we live on different planets. For instance, here
 > you refer to "a commit's *content* object" and I have only a vague
 > idea what that is.

The content object in a traditional git repository is just a tree
object (representing the top directory in the working tree).  In a
submodule, that can be a commit which comes from a different
repository (that of the submodule's project).

 > Neither do I understand "recursive DAG" or why you put it in quotes.
 > Your first paragraph could as well be in chinese.

By taking the set of ancestors of this commit along with their parent
relations, you get a DAG.  The DAG for a submodule is represented by
that commit in the main project's DAG, thus "recursive".  It's in
quotes because I have no idea whether that terminology corresponds to
a useful mathematical idea. :-)

 > You said earlier that git represents a submodule as a tree object
 > that is itself a commit. But it cannot be the commit that
 > represents the current (pristine) tree in the submodule, else I
 > could not make a commit in the submodule (or pull there) without
 > makeing a commit in the containing repo/branch.

I'm not sure what you mean by this.

 > So the best it can be is the nominal version of the submodule, as
 > specified in the .gitmodules file, right?

Not quite.  First, the .gitmodules file does *not* specify a version,
except implicitly for the first checkout.  After that, it will be
specified by the commit representing that submodule in the tree object
representing the parent directory of the submodule.  Consider an
example, below.

We have app, the toplevel directory for our project, app/src, a
directory containing the code for our main program, and app/lib, a
submodule (directory) containing a library developed by another
project.  The contents of the file implementing a commit will look
something like this (the comments following # are not part of the

    parent: <SHA1>                    # refers to a commit object
    date: Tue Apr 10 01:58:00 2018
    tree: <SHA1>                      # refers to a tree object

The tree object will look something like:

    README: <SHA1>                    # refers to a blob object
    Makefile: <SHA1>                  # refers to a blob object
    src: <SHA1>                       # refers to a tree object
    lib: <SHA1>                       # refers to a commit object

Now, the SHA1 for lib is initialized to the HEAD of the repo named in
.gitmodules.  After that, if the repo in lib is changed, either by
pulling new commts from lib's upstream or by a local commit, nothing
happens immediately, but you can use "git submodule update
<submodule>" to update that tree object (in the index).  Normally,
this just checks out the commit recorded in the tree object displayed
above, but it can also be configured to merge any local commit or
rebase the local commits in the submodule on the commit in the tree
object.  Then if you commit the main project, the tree object
representing the project in the object database will be updated to
reflect the HEAD commit in the submodule.

 > It has nothing to do with preferences. The idea of the UUID is that it
 > remains invariant under mutation, so a hash just doesn't cut it, you
 > need some non-deterministic seed (the things you listed aren't enough to
 > ensure that, BTW, as they can be faked).

Oh, I see.  This is what the Bazaar folks call a "container ID", which
they use to track a file across renames and the like, even if there
are content changes at the same time.

 > In a Unix file system, the inode represents file identity. It does not
 > change when the file is mutated. This must be different in git, then,
 > since a hash can only refer to a specific version of the file. Does each
 > blob object contain a reference to its previous version(s), or is
 > tracking identity of files done only at the commit level?

Tracking identity of files in the sense of UUID as you describe is not
done at all.  What happens is that tree objects associate file names
with "blobs" of content.  If this pair changes (a name disappears, a
new name appears, or the blob associated with a name changes) git will
check to see if the relevant blob, or one whose diff is only a "small"
fraction of the filesize, appears elsewhere in the project.  If so,
git interprets that as a rename or a copy of the file.  But it's not
hard to imagine cases where files with independent origin get the same
content (eg, empty files such as are used to ensure that directories
are recorded in git).

 > But there are now also ["ghost"] objects that are not manifest,

Do they have content, or are they empty, waiting to be filled and
plopped down somewhere on the file system?

[1]  git adds a fourth component, the content of the index, but even I
only use that as a way to cache adds so I don't have to specify all
interesting files on the command line in case of a partial commit.
Please ignore that possibility here.

[2]  There's also a culture of "commit early and often" (and edit your
commits), which keeps the working tree "close" to the local repo.  The
Mercurial and Bazaar communities have a "commit only complete,
coherent changesets" bias, and when it gets extreme the working tree
can get scarily divergent from the local repo.

darcs-users mailing list

Reply via email to