Re: git workflow for D

H. S. Teoh via Digitalmars-d-learn Tue, 05 Dec 2017 10:46:50 -0800

On Mon, Dec 04, 2017 at 06:51:42AM -0500, Nick Sabalausky (Abscissa) via 
Digitalmars-d-learn wrote:
> On 12/03/2017 03:05 PM, bitwise wrote:
> > I've finally started learning git, due to our team expanding beyond
> > one person - awesome, right?
> 
> PROTIP: Version control systems (no matter whether you use git,
> subversion, or whatever), are VERY helpful on single-person projects,
> too! Highly recommended! (Or even any time you have a directory tree
> where you might want to enable undo/redo/magic-time-machine on!)


+100!  (and by '!' I mean 'factorial'. :-P)

I've been using version control for all my personal projects, and I
cannot tell you how many times it has saved me from my own stupidity
(i.e., have to rollback a whole bunch of changes, or just plain ole
consult an older version of the code that I've forgotten). Esp. with
git, it also lets me play with experimental code changes without ever
worrying that if things don't work out I might have to revert everything
by hand (not fun! and very error-prone).

In fact, I use version control for more than just code: *anything*
that's text-based is highly recommended to be put under version control
if you're doing any serious amount of editing with it, because it's just
such a life-saver. Of course, git works with binaries too, but diffing
and such become a lot easier if everything is text-based.  This is why I
always prefer text-based file formats when it comes to authoring.

Websites are a good example that really ought to be under version
control.  Git, especially, lets you clone the website to a testing
server where you can experiment with changes without fear, and once
you're happy with the changes, commit and push to the "real" web server.
Notice an embarrassing mistake that isn't easy to fix? No problem, just
git checkout HEAD^, and that buys you the time you need to fix the
problem locally, then re-push.

I've also recently started putting certain subdirectories under /etc in
git.  Another life-saver when you screw up a configuration accidentally
and need to revert to the last-known good config. Also good for
troubleshooting to see exactly what changes were made that led to the
current state of things.

tl;dr: use version control WHEREVER you can, even for personal 1-man
projects, not only for code, but for *everything* that involves a lot of
changes over time.


> > Anyways, I've got things more or less figured out, which is nice,
> > because being clueless about git is a big blocker for me trying to
> > do any real work on dmd/phobos/druntime. As far as working on a
> > single master branch works, I can commit, rebase, merge, squash,
> > push, reset, etc, like the best of em.
> 
> Congrats! Like Arun mentioned, git's CLI can be a royal mess. I've
> heard it be compared to driving a car by crawling under the hood and
> pulling on wires - and I agree.
> 
> But it's VERY helpful stuff to know, and the closer you get to
> understanding it inside and out, the better off you are. (And I admit,
> I still have a long ways to go myself.)

Here's the thing: in order to use git effectively, you have to forget
all the traditional notions of version control. Yes, git does use many
of the common VC terminology, and, on the surface, does work in similar
ways.

BUT.

You will never be able to avoid problems and unexpected behaviours
unless you forget all the traditional VC notions, and begin to think in
terms of GRAPHS. Because that's what git is: a system for managing a
graph. To be precise, a directed acyclic graph (DAG).

Roughly speaking, a git repo is just a graph (a DAG) of commits,
objects, and refs.  Objects are the stuff you're tracking, like files
and stuff.  Commits are sets of files (objects) that are considered to
be part of a changeset. Refs are just pointers to certain nodes in the
graph.

A git 'branch' is nothing but a pointer to some node in the DAG. In git,
a 'branch' in the traditional sense is not a first-class entity; what
git calls a "branch" is nothing but a node pointer. The traditional
"branch" is merely a particular configuration of nodes in the DAG that
has no special significance to git.

Git maintains a notion of the 'current branch', i.e., which pointer will
serve as the location where new nodes will be added to the DAG. By
default, this is the 'master' branch (i.e., a pointer named 'master'
pointing to some node in the DAG).

When you run `git commit`, what you're doing is creating a new node in
the DAG, with the parent pointer set to the current branch pointer. So
if the current branch is 'master', and it's pointing to the node with
SHA hash 012345, then `git commit` will create a new node with its
parent pointer set to 012345.  After this node is added to the graph,
the current pointer, 'master', is updated to point to the new node.

By performing a series of `git commit`s, what you end up with is a
linear chain of nodes, with the current branch ('master') pointing to
the last node.  This, we traditionally view as a "branch", but in git,
there is nothing special at all about this chain; it's just a (sub)graph
of some nodes. The git 'branch' is nothing but a pointer to the last of
these nodes. You can easily make this pointer point to something else --
you wouldn't normally do this, but sometimes it can be useful.

You can also decide that instead of adding new nodes to 'master', you
want to add new nodes elsewhere in the DAG. No problem, just `git
checkout` some arbitrary node, and start running `git commit` on it. The
first new commit will take that node as parent, and thereby start
creating a new chain of nodes "branching off" the 'master' chain.

Merging a branch in git is likewise not something you'd think of in
traditional VC terms; it's basically nothing but creating a new node
with two parents, one from the tip of each respective branch. You can
'merge' any two arbitrary nodes together. Though of course, in general
you'll end up with a huge number of conflicts if the node contents
aren't correlated with each other -- but git doesn't actually mind that;
you can actually overwrite all the contents with something else
altogether and commit that, and git will happily take that as the
"merge" of the two unrelated branches. The resulting graph won't make
any sense in terms of revision history in the traditional VC sense, but
git doesn't care. The point is that as far as git is concerned, it's all
just a DAG.  The fact that the contents of two adjacent nodes happen to
be similar is just a "coincidence", albeit a usual one.

The more 'arcane' git operations like rebasing, history rewriting, etc.,
are at the end of the day nothing more than graph operations, updating a
bunch of pointers and moving nodes around.  If you begin thinking of
your repo as a graph and forget traditional VC notions of branches,
you'll find that git suddenly starts to "makes sense", and you'll be
able to do amazing things to your repo without losing your way.


[...]
> ([...] there's nothing worse than accidentally loosing a bunch of
> important code, or finding you need to undo a bunch of changes that
> didn't work out.)

If you think in terms of graphs, you'll hardly ever need to worry about
losing changes.  Just think in terms of code: if you were given a bunch
of pointers to nodes in a graph, and you need to update these pointers,
what's the safest way to do it?  Easy: just save the pointers to some
local variables, then do whatever updates you want, and if it doesn't
work out, just overwrite the pointers with the saved values, and you're
back to where you started.

In git, because everything is SHA-hashed, nodes are actually immutable.
Even the so-called history rewriting, technically speaking, isn't really
"rewriting"; it's actually creating a NEW subgraph that just happens to
be similar to the older part of the graph plus some changes, and
updating your refs (pointers) to point to nodes in the new part of the
graph instead.  In git, nodes that have nothing pointing to them are
considered garbage; `git gc` will delete them from the graph.  So once
all your pointers are pointing to the new nodes, you've effectively
discarded the old nodes; hence the overall effect is "rewriting" the
graph.  But if you still keep a ref to the old nodes, they will still be
there; nothing is be lost.

It's like dealing with immutable values in D: you can never change them,
but you *can* make (modified) copies of them and changing your pointers
to point to the copies instead of the original values.  As long as you
still keep refs to the old nodes, they will never be lost no matter what
you do to your graph.  And note that the parent pointers in each node
are also part of the SHA hash, so the topology of the old part of the
graph is immutable too.  There is literally nothing you can do that can
change the content or topology of those old nodes. As long as you have a
way to reach them, you will still have your old history completely
intact.

And how do you create backup copies of your pointers? Easy: remember a
git 'branch' is nothing but a pointer? Well, so you just go `git
checkout <branch>; git checkout -b backup_ref` and now you have a
pointer called 'backup_ref' that points to that same node that <branch>
is pointing to.  Now you can do whatever you want to <branch> -- add
new commits, overwrite it with a ref to a completely different node,
whatever.  If at any point you decide that you want it to point to the
original node again, just `git checkout <branch>; git reset --hard
backup_ref`.  As long as you don't touch backup_ref, you will be able to
go back to the original state.

(See? This is why you have to stop thinking of a git repo in traditional
VC terms.  Your git repo is a graph. (With immutable nodes.) That's all
there is to it.)


> One thing to keep in mind: Any time you're talking about moving
> anything from one repo to another, there's exactly two basic
> primitives there: push and pull. Both of them are basically the same
> simple thing: All they're about is copying the latest new commits (or
> tags) from WW branch on XX repo, to YY branch on ZZ repo. All other
> git commands that move anything bewteen repos start out with this
> basic "push" or "pull" primitive. (Engh, technically "fetch" is even
> more of a primitive than those, but I find it more helpful to think in
> terms of "push/pull" for the most typical daily tasks.)

Again, this will all make so much more sense if you think in terms of
graphs.

What `git fetch` does is to download a bunch of nodes from a remote
source.  Don't even think in terms of branches; think in terms of
individual nodes (which imply their own graph connectivity structure --
because the parent pointers are an immutable part of them) that are
downloaded from the remote source.  After downloading these nodes, git
will create a new pointer (i.e., ref) to point to the last node (i.e.,
the node from which the other nodes can be reached), usually with a name
like upstream/somebranch.  There is nothing special about this name
besides the convention that we use names of the form x/y for pointers
named 'y' that we downloaded from 'x'; it's just a pointer to some
nodes that you downloaded off the 'net.

What 'git pull' does is to try to reconcile these downloaded nodes with
the nodes in your local branch -- and here is where wrinkles can arise,
because, by convention, git will try to merge the nodes from x/y into
the local branch called y.  It's all good if the local branch y points
to an ancestor of x/y, i.e., your local branch is just a subgraph of the
remote branch, and since the parent pointers of the downloaded nodes
already point to y (i.e., they are already a part of the graph! --
because they share an ancestor node), the only thing that's needed is to
update y to point to x/y (i.e., the new tip of the branch) instead.
This is called 'fast-forwarding'.

But what if your local branch has diverged from the remote branch? I.e.,
the nodes in local branch 'y' share a common ancestor with the
downloaded nodes in x/y, but have different descendent nodes. Now we
cannot simply set y to x/y, because that would cause you to lose your
pointer to your local nodes, which means `git gc` will garbage-collect
them (i.e., your local changes will be lost).  So git tries to be
'helpful' here by attempting to merge the nodes together -- i.e., create
a new series of nodes that incorporate the changes from *both* y and
x/y.  Unfortunately, this process often causes further problems, because
remember, nodes are immutable, so the only way you can merge the
changesets together is by creating new nodes ("merge commits" in git
parlance) and discarding the old ones.  But once you do that, your local
branch 'y' is no longer the same as the remote one, so when it comes
time to push your changes to other collaborators, or to pull from remote
again later, it causes more conflicts in a never-ending spiral.

The best approach is to avoid this situation altogether, by designating
certain branches (usually master) as pull-only, i.e., you never commit
changes to them, all your changes are committed to local branches. In
terms of graphs, you never change the value of the 'master' pointer, but
may add new nodes to the graph by using other pointers ("local
branches") for that purpose.  Then `git pull` will always be
fast-forward only (the value of the local 'master' pointer will always
be equal to, or an ancestor of, the remote 'master' pointer, so it is
always possible to just replace the local 'master' pointer with the
remote value without losing any nodes).  This is why I recommend to
*always* run:

        git pull --ff-only upstream master

The --ff-only tells git not to try to be smart and create a mess of
merge commits, but to only ever fast-forward the master pointer.  If
this fails, then you know you've made a mistake and updated the master
pointer where you should have used a local branch instead.  (How to fix
this is left as an exercise for the reader: hint, remember 'master' is
just a pointer. Just create a new local branch to point to the current
nodes, i.e., backup your pointer, then reset 'master' to the last common
ancestor with the upstream nodes, then `git pull`, and rebase your local
branch afterwards.)


> > How does one keep their fork up to date? For example, if I fork dmd,
> > and wait a month, do I just fetch using dmd's master as a remote,
> > and then rebase?

If you keep to the convention of never committing to master locally,
then you can just `git pull --ff-only upstream master` and it will pull
in the latest changes.  Then you just rebase your local branch(es) on
top of master.

In graph-centric terms, running `git rebase master` in a local branch B
does the following: (1) find the common ancestor A of master and B; (2)
for each node in B up to (but not including) A, create a corresponding
new node that contains the same changes, but is based on the tip of
master instead of A; (3) set B to point to the last of the new nodes.

Special note: since rebase isn't actually modifying nodes -- remember
nodes are immutable -- if you're unsure or want to be extra-careful, you
can keep a spare reference to the old tip of B before running the
rebase, like this:

        git checkout B
        git checkout -b B-backup        # backup pointer
        git checkout B                  # set current branch back to B

        git rebase master               # rebase B onto master

If you then run `git log --graph --all`, you'll see that there are now
*two* copies of the commits you made in B: one in the original position
branching off master at ancestor A, and the other is now based on
master.  'B' will now point to the new nodes, but you'll still be able
to access the old nodes via 'B-backup'.  If at any time you wish to
'undo' the rebase, just reset B to B-backup.  (The new nodes will then
become unreferenced, and will be garbage-collected. Unless you kept
another pointer to them, of course.)

See? No danger of data loss. (Unless you forget to keep a spare pointer
to your old nodes. But even in that case, there's still a way out with
`git reflog` -- git gc doesn't actually delete nodes until they're past
a certain age, so as long as you notice the problem early and not a week
or month later, your old nodes will still be there. You just have to dig
through `git reflog` to find the old pointer values, i.e., SHA hashes.
Once you find the right SHA hash, just `git checkout <hash>` to go back
to the old node, then `git checkout -b <oldbranch>` to create a new
branch pointer to point to the old nodes.)


[...]
> > and do I need a separate branch for each pull request, or is the
> > pull request itself somehow isolated from my changes?
> 
> You *should* create a separate branch for each pull request unless
> you're a masochist. There's *no* isolation other than whatever
> isolation YOU create.  (Not my idea of award-winning software design,
> but meh, it is what it is).
> 
> This is why people are adamant about making a separate branch for each
> pull request. *Technically* speaking you don't absolutely HAVE
> to...But if you *don't* create a separate branch for each PR, you're
> just asking for pain: It'll be a PITA if you want to create another PR
> before your first one is approved and merged. And it'll be a PITA if
> your PR is rejected and you want to do any more work on the codebase.
[...]

Just think of it as updating a graph.  You have a local copy of the
graph, and you've added a bunch of new nodes to it.  Now you want the
upstream people to add your new nodes to their copies of the graph too.
Suppose further that these nodes represent several different changesets.
What's the best way to manage these nodes?

It should be obvious that the best way is to use a different pointer for
each changeset, so that if the upstream people decide to merge changeset
A but reject changeset B, you can keep your local copy of the graph
straight.  If you use the *same* pointer for all changesets, then it
should be no surprise when things become a big mess when upstream merges
some changesets but not others, yet locally you have no way of
addressing each changeset separately.

Even if all your changes eventually get merged, in the interim you may
be running git rebase to apply your changes to the latest upstream code;
if you only keep a single pointer around for everything, you're going to
lose track of what's going on really quickly.

There's no *requirement* that you do things this way, of course, but
it's just a matter of being able to keep your own changesets straight
when you have to reconcile your local graph with the remote one.


T

-- 
Never wrestle a pig. You both get covered in mud, and the pig likes it.

Re: git workflow for D

Reply via email to