Git forking for fun and profit

Assaf Arkin Wed, 30 Apr 2008 18:21:52 -0700

= On The Morality of Forking

One thing I love about open source is that it gives you the right to fork
[1].  Don't like how the project is managed?  Want to take it in a different
direction?  Tired of seeing a broken trunk and re-fixing the same typos?
 Copy it over and start a new project.  The Apache project started life that
way [2].


In open source culture, forking is often used as a four-letter word.  One
fork means two different code bases.  What happens next depends on the tools
you use, but typically keeping these two forks synchronized, sharing changes
and bug fixes, can be a pretty daunting if not impossible task.  Even
branching in the same code repository is a tricky maneuver -- when was the
last time you did an SVN branch only to fix a one-line bug?

There's a high bar to forking, so people don't do it lightly.  We generally
prefer not to, reserving forking to dead projects and irreconcilable
artistic differences.  Being forked is a stigma you don't want on your
project's resume.

At least it used to be that way.  Back in the days of ugly source control
systems [3], forking would lead to all sorts of nasty side effects.  So I
want to correct that impression and explore a different kind of forking --
one that's fun, healthy and a way to build a better community around the
forked project.


= Forking Alone

The way SVN works, and I assume you're familiar with SVN, you check the code
out of a central repository and into your working copy.  You make changes
locally, and when you're done, you commit those back to the central
repository (or hand someone a patch to commit on your behalf).

The working directory is an offline copy of "the real thing", a local cache
for easy editing.  Officially we're all working on the same code base,
whatever edits you do in the privacy of your own computer is your business
(until you commit).

Distributed version control works in a different way.  I'm going to talk
exclusively about Git for reasons that will become obvious later [4][5], but
the same goes for whatever distributed version control you decide to use.

When you source control is distributed, there's really nothing else to do
all day but fork.  To get anything meaningful to happen you start by cloning
the central repository to your hard disk.  What you end up with is a full
blown copy, history and all.  Clone, branch, or whatever else you call it,
it really is a fork.

Now that you're working with your very own fork, you can branch, commit,
rollback, merge and do all sorts of interesting things on your own
repository.  You can also fetch the latest changes from the central
repository, and push the work you've done back to the central repository, or
send a patch that someone else can pull in.

Forking in private is most of what you do, but not all.


= Forking In Public

Open source is great, open source with open development is teh awesome.
 Open development is done in public.  You don't go hiding in a cave only to
emerge a few months later with a big code drop.  Do the work where everybody
else gets to check it out, participate and hopefully contribute.

We don't look too kingly on anti-social behavior.  On the other hand, tools
like Git are great for cave digging and dwelling.  Why would I think forking
is such a good idea?

To begin with, social problems are not solved with technology.  The point of
a source control system is to make development easier, not annoy people into
socializing.  That should come from a fun, creative and supportive
community.

Git is wonderful for committers.  Rule #1 of source control: don't break the
trunk!  When you break the trunk, everybody else has a bad day.  They can't
get any work done.

But during development we often reach this point when you've got something
incomplete, perhaps broken, but significant enough that you'll want to check
it in.  You want that checkpoint because it allows you to move forward and
experiment with different ideas.  Worse case, you can always roll back.  The
ability to take these checkpoints and make local commits and branching
without breaking trunk is quite powerful.  Use it wisely.

Git is also wonderful for those of you who are not committers (yet).  You
can get to be a committer by racking up karma points.  You get more points
form major contributions.  Major contributions require a lot of work, you'll
want to source control it, you'll want to involve other developers so they
can help make it happen.  With SVN you can't do that until you get your
committer status.  Catch-22.

So fork in public.  You do that by setting up a public repository,
synchronizing it with the central repository, and pushing your changes to
your public repository.  Other developers can then clone your repository,
check the code you're working on, use it and test it, send you patches and
even push changes to your repository, working together away from the trunk.

When you're done working on a big enough change you can merge all these
changes into a set of patches and send them over for inclusion in the
central repository.  That way you can contribute as little or as much as you
like without waiting for SVN access, even better, you can share these
features with others while waiting for them to be included upstream.


= Forking is Fun

So let's review some of the things you get out of using Git.

You can branch, commit, go back in history and do all sorts of useful things
offline.  Offline means you can do them on a plane or a train, which some
people think is really cool.  Even if you're always connected, you'll love
how everything happens so damn fast.  It's like strapping a T3 line straight
into the ethernet port.

You don't have to bother anyone else.  You can branch as often as you need
to, which is damn useful when you're working on two things at the same time.
 (Yes, SVN had branches since forever.  Not the same thing.)  Actually, I
recommend branching any time you're working on something: branch, change,
commit, merge.  Did I mention, fast?

You don't have to hold on commits until everything works.  You can write
test cases and commit, get some code working against these tests and commit,
get more code working and commit again.  When you're finally ready with that
changset is when you push it into the central repository, by which point you
won't be breaking any trunk.  Frequent commits are great if you like to
experiment with new ideas, or share work in progress.

If you're doing something big, you can fork the central repository and get
other people working against that fork, helping to make it happen.  You
don't need to be a committer to start off on a major contribution, you don't
have to wait for a patch inclusion before others can start using your code.
 Best, you don't need to bother with SVN branches, which ironically are
harder to synchronize with trunk than using Git.

Come to think of it, just giving all that power to contribute to developers
who are not yet committers is a killer feature, and why I'm writing this
piece to begin with.

I realize all this fun stuff might be hard to imagine, might not even sound
plausible, if you're used to the SVN way of doing things, but once you let
go of centralized source control, everything in the universe will start
making sense.


= Parenting and Custody

I started by saying open source provides the right to fork.  Each open
source license expresses that right in a different way, the one we use is
the Apache Software License.  That works as long as we provide all the code
under the terms of the ASL, which we can do since every contribution
included an agreement to let Apache distribute it under the ASL.

To avoid all sorts of nasty custody fights, which we really don't have time
to deal with, we're trying to get software done, we have to make sure all
the code coming in is accompanied by the Contributor License Agreement [6].
 We have two ways of doing that.

Some code comes directly from committers, all of which signed the CLA, easy.
 Other code comes from patches, which go through JIRA.  JIRA gives you the
option to CLA the patch before uploading it, telling committers they can go
right ahead and add it to the code base.  That way we have a commit trail
showing who contributed what.

So you understand why the official source repository has to be hosted by
Apache, and while we're waiting for Git to happen, right now we're stuck
with SVN.  No Git for us?  Turns out, it's not such a big problem.

For starters, you can always use git-svn [4] to clone the SVN repository and
then use Git instead of SVN.  You get all the awesomeness of Git and an easy
way to keep it consistent with trunk.  I'll explain how to do this when
we'll talk about the mechanics of forking.

Ff you find someone you trust who's already managing a Git clone that's
synchronized with SVN, you can clone their Git repository.  I use Victor's
Git repository [7].

If you're working on something big, you'll probably want to fork in public,
creating a remote repository that others can tap into.  Lots of options to
choose from, the one I use, because it's wonderful and I don't want to host
one myself, is Github [5].  If you want to clone someone else's repository
to create your own public repository, you'll love the "fork" button.  Can't
miss it.

Now, all of this introduces an interesting problem.  Say you decide to work
on something big enough that you need a public repository.  You also need
other people to help you, by contributing their changes and fixes.  You want
to bring all that code into the central repository so this cool feature
shows up in all future releases, but the code is now a mishmash of
contributions from different people.  How would you get something like that
approved?

Here are three things you can do to help us approve these contributions:

1.  If in doubt, ask.  Mailing list is the best place.  We'll revise these
guidelines as we learn what works best.

2.  Keep an ongoing commit trail.  If you accept a patch from someone else,
commit it and include an attribution in the commit message (see [8] for
guidance).  When you use git format-patch, it creates one patch for each
commit, we apply these individually, preserving the commit trail.  Pushing
does the same (but double check).

3.  Ask contributors to sign the CLA [9].  It's quick, it's easy and you
don't have to be a committer.  Check the list of committers and
non-committers who already signed the CLA [10].


= The Mechanics of Forking

So let's discuss the few ways in which you can fork, starting with git-svn:

$ git svn clone
http://svn.apache.org/repos/asf/incubator/buildr/trunkbuildr -r
<revision>

Apache maintains one huge repository shared by all projects, and while Git
will only clone the history for a given project, it will need some time to
process through an endless stream of revisions.  How long?  Really long.
 Most likely you don't need all that history going back to the very first
day, so just clone from a recent revision, it will only take a couple of
minutes.

Since all projects share the same repository, svn info will show you two
revision numbers.  The first, the actual SVN revision number, is not the one
you want -- cloning it will fetch nothing.  The second, the "Last Changed
Rev" is the one you want to clone from.

Check that it worked:

$ cd buildr
$ ls
.....
$ git svn info
Path: .
URL: http://svn.apache.org/repos/asf/incubator/buildr/trunk
.....

You'll want to set your name/e-mail so they show up in all commits, which
you can do on each Git repository, or once using the --global option:

$ git config --global user.name Assaf
$ git config --global user.email [EMAIL PROTECTED]

To pull updates from SVN and fix (rebase) all your local commits against the
most recent SVN update:

$ git svn rebase

Best way to work on a new feature is to start with a new branch:

$ git checkout -b teh-awesome
$ git branch
  master
* teh-awesome

Do some work, commit as often as necessary and when you're done, rebase
these commits against the latest changes from SVN, and generate some
patches:

$ git svn rebase
$ git format-patch origin

You'll get one patch file per commit.  Depending on what you did to get
here, that could be a boatload of patches, so you might want to roll
together (squash) some commits, or even change their order (that way, we'll
think you wrote test cases ahead of the code!)  Check the documentation, git
rebase -i master is your friend.

Cloning someone else's repository is just as easy, for example:

$ git clone git://github.com/vic/buildr.git
$ cd buildr

This time around you're working against a Git remote repository, so you grab
updates using git fetch/pull and rebase accordingly [11].  Everything else
involving branching and patching works the same way.

You can also work with both at the same time.  A local repository that
clones a remote repository, the one you're using to share your work with
others, and also synchronizes with SVN trunk.  (Bet you didn't know, but
your local repository can synchronize with several remote repositories)

You'll want to start with git svn clone and then add the remote repository
using git remote add.  Or just use the buildr-git script [12], courtesy of
Victor, which sets up everything to work just right, and adds useful
commands like git apache-fetch, git apache-pull and git synchronize.

There's a few command line options you can set to use this script with any
other Apache project.

So go, have fun, and Git away!



[1] http://en.wikipedia.org/wiki/Fork_(software_development)
[2] http://httpd.apache.org/ABOUT_APACHE.html
[3] http://www.youtube.com/watch?v=4XpnKHJAok8
[4] http://utsl.gen.nz/talks/git-svn/intro.html
[5] http://github.com/
[6] http://apache.org/licenses/icla.txt
[7] http://github.com/vic/buildr/tree/master
[8] http://www.apache.org/dev/committers.html#applying-patches
[9] http://apache.org/licenses/#clas
[10] http://people.apache.org/~jim/committers.html
[11]
http://git.or.cz/gitwiki/GitFaq#head-1168c3027a2b7060df8c5cf141756c8e0e33139c
[12] http://github.com/vic/buildr/tree/master/doc/scripts/buildr-git.rb

Git forking for fun and profit

Reply via email to