Hi All

the recent discussions here gave me a much better picture of what kind
of features we should strive for when we add in-repo branches. But
perhaps I should not (yet) talk about branches. Let us talk instead
about "patch sequences".

The way Darcs works today is that it knows of exactly one sequence of
patches per repo. [As an aside: It is absolutely necessary for Darcs to
know in which sequence patches are applied, since what we internally
refer to are not patches in the abstract sense as the user sees them,
but rather their concrete representations. And --in contrast to abstract
patches-- these apply only in a much more narrowly defined context.]

Internally, this sequence is broken up into segments, where each segment
can refer to a parent segment (except the last one at the start of the
repo; the segments are internally called "inventories"). This
segmentation is done for efficiency, mostly, so from a user perspective
we can pretty much ignore it (with a few exceptions, see below).

It should perhaps be noted at this point that the sequence does not
/contain/ the patches, it merely references them. The patches themselves
are stored independently. Like a list of pointers/references to elements
in C or Java.

What happens when we "destroy" a patch, using any one of: amend,
obliterate, unrecord, or rebase suspend? Well, nothing is actually
deleted or destroyed, we merely forget about the existence of the
original patch, since we no longer have a reference to it.

But nothing forces us to "forget" these patches! Instead, each time we
do something "destructive", we could add a new reference to some
internal data structure that is very similar, if not identical, to that
which currently gives us the ordered sequence of patches in our repo.

In particular, since we already have the ability to refer to a parent
segment, all we need to do is to split the current segment and create
two new ones that both refer back to the old segment.

Here is a picture. We start out with the sequence

  A<B<---C<D<*

Here, A is the "oldest" patch and D is the "latest" one; the * denotes
the head of the sequence, which at some point in the past has been split
into A<B and C<D. The arrow point backwards here, like list pointers.

Now let's amend patch D, giving us a new patch E. Today, what we get is

  A<B<---C<E<*

  D

where D is decoupled from our sequence and can no longer be referred to;
it might as well not exist (of course, some other repo may still have a
reachable copy). What I propose to do is change this to

  A<B<---C<---E<*
         ^
          \
           ---D<x

where the small letter 'x' denotes a new "head" (of another sequence). I
come to the problem of naming this sequence/head later. For now, let us
assume we have some name ("x") for it and see what properties this has.
The old patch D is obviously not "applied" in our repo, only the
sequence of patches starting at * is.

Now, as we all know and love, Darcs can --sometimes-- re-order the
sequence of patches. Let us see where this leads us. For instance,
assume we want to get rid of a patch in the past, say C. To do this,
Darcs must commute this patch to the head (*). Assume this suceeds, i.e.
E does not depend on C. We get two new patches E'<C', where E and E' as
well as C and C' both represent the same "abstract patch".

  A<B<---E'<---C'<*

We can now drop C' and get

  A<B<---E'<*
         ^
          \
           ---C'<y

But this picture is incomplete as yet. What about D? We must decide to
which parent sequence we attach it, since its original parent sequence
no longer exists.

We face a similar situation today when we want to do a destructive
operation with a patch behind a tag. If we insist on doing it, we must
destroy the tag, too, since it depends on everything that comes before
it. This is why Darcs chooses to start a new segment whenever it sees a
"clean" tag at the top ("clean" means that the tag actually depends on
all patches preceding it): re-ordering patches forward across a tag is
impossible.

In our situation things are not as inflexible, though. Remember what we
did when commuting C<E: we created two new patches. The old ones are
still there and while we cannot refer to them in our main/current
sequence, we /can/ reference them in another sequence. So this is what
we could do:

           ---C'<y
          /
         v
  A<B<---E'<*
    ^
     \
      ---C<D<x

Looks nice, doesn't it? Note how the sequence of patches at y remains
exactly the same, only the segmentation has changed. We need to do only
a minimal amount of work.

Still the above solution has its problems and they have to do with naming.

For a satisfying UX we need to have a way to refer, in a unique manner,
to a head such as the "x" and "y" I have been using. We could of course
create /some/ unique ID, each time we create a branch.

I think this is just too awkward in practice: if I have two identical
clones of a project and obliterate the same patch in both repos
independently, then I get two differently named branches. If I pull
patches between them with some extra option that also pulls
branches/heads, I now have two identical but differently named branches
in a single repo. This is more or less what git users have to cope with.
Git has features on top of that: you can tell it that local branch "x"
tracks remote branch "y", that is, they become (locally) identified (or
something like that). I would very much like to avoid this, it is
difficult to understand and requires too much initial setup and maintenance.

Can we solve this without inventing new names?

What could work, in principle, is to let Darcs list our sequences one by
one and we choose whichever we want to refer to interactively. But
having to do that every time would be even more awkward than the git
solution.

One obvious idea is to name branches by referring to their top-most
patch (hash). This requires that no two branches have the same patch on
top so we can uniquely refer to the head by referring to this patch. Is
there a way to ensure that?

The scheme as presented so far does not have this property. For
instance, in the picture above obliterate D in head x, then x starts
with C and y with C', both of which are "identical" as far as the user
is concerned, that is, they have the same hash. We could even arrive at
situations where the whole patch sequence is identical.

Let us review the earlier example. We have

  A<B<---C<---E<*
         ^
          \
           ---D<x

and want to commute C<E, so we can obliterate C. Suppose C also commutes
with D, say D''<C''. In this case we can just drop C'' from x and arrive at

      ---D''<x
     /
    v
  A<B<---E'<*
         ^
          \
           ---C'<y

and our invariant holds. But what if we cannot commute C<D, that is, D
depends on C?

Well, it turns out there is yet another alternative that is even more
elegant: We throw away the C' and instead create a new branch for C
exactly where it was before:

      ---C<D<x
     /
    v
  A<B<---E'<*
    ^
     \
      ---C<y

and now we can join the common "tail" of "x" and "y":

  A<B<---E'<*
    ^
     \
      ---C<y
         ^
          \
           ---D''<x

This works because for the purpose of keeping a reference to a patch it
doesn't matter which representative we chose. If we obliterate the same
patch in two clones of a repo, and then pull from one to the other, we
will still get only one branch!

I would like to point you now to a paper containing a proof that we can
always choose to branch in such a way that if a patch is contained in
two different branches, then it can be commuted back into the common
"tail" of both. Unfortunately this paper does not exist yet.

However, the good news is that we can be pretty sure the theorem holds.
We have a test-suite where properties like this are tested using
QuickCheck. And we are /already/ relying on it because without this
fundamental property operations such as pull, push, or apply would not
work at all.

Conclusion
----------

If we choose to add in-repo branches to Darcs, we do /not/ need to be
concerned with branch identifiers! It is enough to able to name patches
to uniquely identify a branch. It doesn't even have to be the top-most
patch, just any patch in the initial segment of a branch. I think this
will result in a UI that is at least as easy to use as that of Mercurial
and still consistent and efficient as with git.

We can and should allow to attach symbolic "human readable" names to
branches, but they will have no fundamental relevance. It's a bit like
using 'darcs pull -p patchname', you can use it and it is convenient but
if in doubt there is always the hash to uniquely identify a single
patch. A spontaneous idea I just had is this: if we add a feature to
give symbolic names ("aliases") to patches, then these could double as
branch names. We could maintain these associations in a simple text file
with one line per alias and add a pref 'aliasfile' so we can keep the
file under version control, quite similar to 'boringfile'.

I'll write more about what I think a good UI to in-repo branches could
be in some subsequent message.

Cheers
Ben

_______________________________________________
darcs-users mailing list
darcs-users@osuosl.org
https://lists.osuosl.org/mailman/listinfo/darcs-users

Reply via email to