= On The Morality of Forking One thing I love about open source is that it gives you the right to fork [1]. Don't like how the project is managed? Want to take it in a different direction? Tired of seeing a broken trunk and re-fixing the same typos? Copy it over and start a new project. The Apache project started life that way [2].
In open source culture, forking is often used as a four-letter word. One fork means two different code bases. What happens next depends on the tools you use, but typically keeping these two forks synchronized, sharing changes and bug fixes, can be a pretty daunting if not impossible task. Even branching in the same code repository is a tricky maneuver -- when was the last time you did an SVN branch only to fix a one-line bug? There's a high bar to forking, so people don't do it lightly. We generally prefer not to, reserving forking to dead projects and irreconcilable artistic differences. Being forked is a stigma you don't want on your project's resume. At least it used to be that way. Back in the days of ugly source control systems [3], forking would lead to all sorts of nasty side effects. So I want to correct that impression and explore a different kind of forking -- one that's fun, healthy and a way to build a better community around the forked project. = Forking Alone The way SVN works, and I assume you're familiar with SVN, you check the code out of a central repository and into your working copy. You make changes locally, and when you're done, you commit those back to the central repository (or hand someone a patch to commit on your behalf). The working directory is an offline copy of "the real thing", a local cache for easy editing. Officially we're all working on the same code base, whatever edits you do in the privacy of your own computer is your business (until you commit). Distributed version control works in a different way. I'm going to talk exclusively about Git for reasons that will become obvious later [4][5], but the same goes for whatever distributed version control you decide to use. When you source control is distributed, there's really nothing else to do all day but fork. To get anything meaningful to happen you start by cloning the central repository to your hard disk. What you end up with is a full blown copy, history and all. Clone, branch, or whatever else you call it, it really is a fork. Now that you're working with your very own fork, you can branch, commit, rollback, merge and do all sorts of interesting things on your own repository. You can also fetch the latest changes from the central repository, and push the work you've done back to the central repository, or send a patch that someone else can pull in. Forking in private is most of what you do, but not all. = Forking In Public Open source is great, open source with open development is teh awesome. Open development is done in public. You don't go hiding in a cave only to emerge a few months later with a big code drop. Do the work where everybody else gets to check it out, participate and hopefully contribute. We don't look too kingly on anti-social behavior. On the other hand, tools like Git are great for cave digging and dwelling. Why would I think forking is such a good idea? To begin with, social problems are not solved with technology. The point of a source control system is to make development easier, not annoy people into socializing. That should come from a fun, creative and supportive community. Git is wonderful for committers. Rule #1 of source control: don't break the trunk! When you break the trunk, everybody else has a bad day. They can't get any work done. But during development we often reach this point when you've got something incomplete, perhaps broken, but significant enough that you'll want to check it in. You want that checkpoint because it allows you to move forward and experiment with different ideas. Worse case, you can always roll back. The ability to take these checkpoints and make local commits and branching without breaking trunk is quite powerful. Use it wisely. Git is also wonderful for those of you who are not committers (yet). You can get to be a committer by racking up karma points. You get more points form major contributions. Major contributions require a lot of work, you'll want to source control it, you'll want to involve other developers so they can help make it happen. With SVN you can't do that until you get your committer status. Catch-22. So fork in public. You do that by setting up a public repository, synchronizing it with the central repository, and pushing your changes to your public repository. Other developers can then clone your repository, check the code you're working on, use it and test it, send you patches and even push changes to your repository, working together away from the trunk. When you're done working on a big enough change you can merge all these changes into a set of patches and send them over for inclusion in the central repository. That way you can contribute as little or as much as you like without waiting for SVN access, even better, you can share these features with others while waiting for them to be included upstream. = Forking is Fun So let's review some of the things you get out of using Git. You can branch, commit, go back in history and do all sorts of useful things offline. Offline means you can do them on a plane or a train, which some people think is really cool. Even if you're always connected, you'll love how everything happens so damn fast. It's like strapping a T3 line straight into the ethernet port. You don't have to bother anyone else. You can branch as often as you need to, which is damn useful when you're working on two things at the same time. (Yes, SVN had branches since forever. Not the same thing.) Actually, I recommend branching any time you're working on something: branch, change, commit, merge. Did I mention, fast? You don't have to hold on commits until everything works. You can write test cases and commit, get some code working against these tests and commit, get more code working and commit again. When you're finally ready with that changset is when you push it into the central repository, by which point you won't be breaking any trunk. Frequent commits are great if you like to experiment with new ideas, or share work in progress. If you're doing something big, you can fork the central repository and get other people working against that fork, helping to make it happen. You don't need to be a committer to start off on a major contribution, you don't have to wait for a patch inclusion before others can start using your code. Best, you don't need to bother with SVN branches, which ironically are harder to synchronize with trunk than using Git. Come to think of it, just giving all that power to contribute to developers who are not yet committers is a killer feature, and why I'm writing this piece to begin with. I realize all this fun stuff might be hard to imagine, might not even sound plausible, if you're used to the SVN way of doing things, but once you let go of centralized source control, everything in the universe will start making sense. = Parenting and Custody I started by saying open source provides the right to fork. Each open source license expresses that right in a different way, the one we use is the Apache Software License. That works as long as we provide all the code under the terms of the ASL, which we can do since every contribution included an agreement to let Apache distribute it under the ASL. To avoid all sorts of nasty custody fights, which we really don't have time to deal with, we're trying to get software done, we have to make sure all the code coming in is accompanied by the Contributor License Agreement [6]. We have two ways of doing that. Some code comes directly from committers, all of which signed the CLA, easy. Other code comes from patches, which go through JIRA. JIRA gives you the option to CLA the patch before uploading it, telling committers they can go right ahead and add it to the code base. That way we have a commit trail showing who contributed what. So you understand why the official source repository has to be hosted by Apache, and while we're waiting for Git to happen, right now we're stuck with SVN. No Git for us? Turns out, it's not such a big problem. For starters, you can always use git-svn [4] to clone the SVN repository and then use Git instead of SVN. You get all the awesomeness of Git and an easy way to keep it consistent with trunk. I'll explain how to do this when we'll talk about the mechanics of forking. Ff you find someone you trust who's already managing a Git clone that's synchronized with SVN, you can clone their Git repository. I use Victor's Git repository [7]. If you're working on something big, you'll probably want to fork in public, creating a remote repository that others can tap into. Lots of options to choose from, the one I use, because it's wonderful and I don't want to host one myself, is Github [5]. If you want to clone someone else's repository to create your own public repository, you'll love the "fork" button. Can't miss it. Now, all of this introduces an interesting problem. Say you decide to work on something big enough that you need a public repository. You also need other people to help you, by contributing their changes and fixes. You want to bring all that code into the central repository so this cool feature shows up in all future releases, but the code is now a mishmash of contributions from different people. How would you get something like that approved? Here are three things you can do to help us approve these contributions: 1. If in doubt, ask. Mailing list is the best place. We'll revise these guidelines as we learn what works best. 2. Keep an ongoing commit trail. If you accept a patch from someone else, commit it and include an attribution in the commit message (see [8] for guidance). When you use git format-patch, it creates one patch for each commit, we apply these individually, preserving the commit trail. Pushing does the same (but double check). 3. Ask contributors to sign the CLA [9]. It's quick, it's easy and you don't have to be a committer. Check the list of committers and non-committers who already signed the CLA [10]. = The Mechanics of Forking So let's discuss the few ways in which you can fork, starting with git-svn: $ git svn clone http://svn.apache.org/repos/asf/incubator/buildr/trunkbuildr -r <revision> Apache maintains one huge repository shared by all projects, and while Git will only clone the history for a given project, it will need some time to process through an endless stream of revisions. How long? Really long. Most likely you don't need all that history going back to the very first day, so just clone from a recent revision, it will only take a couple of minutes. Since all projects share the same repository, svn info will show you two revision numbers. The first, the actual SVN revision number, is not the one you want -- cloning it will fetch nothing. The second, the "Last Changed Rev" is the one you want to clone from. Check that it worked: $ cd buildr $ ls ..... $ git svn info Path: . URL: http://svn.apache.org/repos/asf/incubator/buildr/trunk ..... You'll want to set your name/e-mail so they show up in all commits, which you can do on each Git repository, or once using the --global option: $ git config --global user.name Assaf $ git config --global user.email [EMAIL PROTECTED] To pull updates from SVN and fix (rebase) all your local commits against the most recent SVN update: $ git svn rebase Best way to work on a new feature is to start with a new branch: $ git checkout -b teh-awesome $ git branch master * teh-awesome Do some work, commit as often as necessary and when you're done, rebase these commits against the latest changes from SVN, and generate some patches: $ git svn rebase $ git format-patch origin You'll get one patch file per commit. Depending on what you did to get here, that could be a boatload of patches, so you might want to roll together (squash) some commits, or even change their order (that way, we'll think you wrote test cases ahead of the code!) Check the documentation, git rebase -i master is your friend. Cloning someone else's repository is just as easy, for example: $ git clone git://github.com/vic/buildr.git $ cd buildr This time around you're working against a Git remote repository, so you grab updates using git fetch/pull and rebase accordingly [11]. Everything else involving branching and patching works the same way. You can also work with both at the same time. A local repository that clones a remote repository, the one you're using to share your work with others, and also synchronizes with SVN trunk. (Bet you didn't know, but your local repository can synchronize with several remote repositories) You'll want to start with git svn clone and then add the remote repository using git remote add. Or just use the buildr-git script [12], courtesy of Victor, which sets up everything to work just right, and adds useful commands like git apache-fetch, git apache-pull and git synchronize. There's a few command line options you can set to use this script with any other Apache project. So go, have fun, and Git away! [1] http://en.wikipedia.org/wiki/Fork_(software_development) [2] http://httpd.apache.org/ABOUT_APACHE.html [3] http://www.youtube.com/watch?v=4XpnKHJAok8 [4] http://utsl.gen.nz/talks/git-svn/intro.html [5] http://github.com/ [6] http://apache.org/licenses/icla.txt [7] http://github.com/vic/buildr/tree/master [8] http://www.apache.org/dev/committers.html#applying-patches [9] http://apache.org/licenses/#clas [10] http://people.apache.org/~jim/committers.html [11] http://git.or.cz/gitwiki/GitFaq#head-1168c3027a2b7060df8c5cf141756c8e0e33139c [12] http://github.com/vic/buildr/tree/master/doc/scripts/buildr-git.rb