I just wrote up a summary of Mercurial and distributed version control systems for a coworker. I figured I'd add it to the discussion in case it helps get some of the main ideas of these systems across...
-Damon ----- Mercurial (hg) is one of a new breed of "distributed" version control systems, which I think have a lot of advantages over traditional systems that have a single copy of the repository. I am using it for my own personal software these days and highly recommend it. If you've heard of or used svk with svn, then you may be familiar with a few of the ideas involved. *Basic Ideas of Distribute Version Control* Here are a some points that should give a flavor for what distributed version control and hg are all about: - It turns out that with the right representation of a repository on disk, you can usually store the entire history of a project in not much more space than it takes to store the current snapshot of the project. And in some cases, you can store the entire history in less space. (also note that the size of compiled libraries and executables often dwarfs the size of the source + history. - Given this, it makes sense for each developer to have a copy of the entire repository on their local hard drive. Then they can do a series of commits on their laptop hard drive without waiting for any kind of network activity (or even while they're on a plane) and then push the changes onto the "master" repository at a later date without losing all the incremental local revisions. Besides mobile/disconnected development, the major benefit of a local repository is that it makes most operations effectively instantaneous. Even some of the slowest operations, such as large (entire-project) diffs become quick. - You take the hit for downloading the repository once, up front. From then on, compressed diffs (just like those used in the repository itself) are sent over the network. The diffs for a revision also come with a cryptographic signature so that when they are expanded you can be sure they're not corrupted (more on that later). - There is a separation/independence between the most recent revision of the project that your repository knows about and the state of the source files on your disk. You can make your source reflect a previous state of the project and then jump forward to the most current state without connecting to the master repository. This separation also means that downloading a set of revisions from the "master" repository (for instance, all the changes that have occurred since your last download) is a separate step from actually updating your source files to reflect those changes. This way, if you're in the middle of some local changes, you can download the latest revisions without merging them into your source, disconnect from the network, keep working on your local changes, and then, only when you're happy with your changes, you can merge them with those that you downloaded. - Since repositories are small and copying local files is fast, there's low overhead for having multiple copies of a repository on your drive. So a branch is implemented as a copy of a repository, along with a pointer to the original (this is very fast because hard links are used for the files that make up the repository itself, with copy-on-write semantics for later changes). You can do independent changes in an independent repository and then merge the results when you're done. Once again, you can do multiple experimental commits on each repository without worries of polluting the master repository with a record of those changes until you're sure they're worth pushing back. - Changes can be either pulled from another repository or pushed to it. Again, the separation between the set of revisions a repository knows about and the state of the source files on the disk means that the owner of the repository gets to review the changes before making them official. - Since multiple branches are the norm, merging is a common operation. As much merge support as possible is included in the tool and the rest is easy to plug in via standard interactive merge tools. The result is that merges tend to be quite easy and when your lines of code don't overlap you typically don't even have to think about them. - Given that each developer has a copy of the entire repository, there's no reason that two developers have to communicate changes through the centralized repository. If I have changes you want, you can pull them from my repository, preserving all the incremental revisions and comments that I've made, and then push the entire thing back to the master when you're done. The history of branches and merges, including who did what, is kept so that the history reflects what actually happened (including the fact that there were 2 parallel branches for a while). - By cryptographically signing each "change set" (not only the state of the files in that revision but also the history of the revision), the revision control system can give you a single unique 40-digit "change set number" that can be verified against an actual repository to guarantee that it is not corrupted in any way. (For day-to-day use, a developer only needs to use the first few digits of the number—enough to be unambiguous within that repository, or can instead use simple, consecutive revision numbers.) - The final idea is that in this model, there's nothing special about the master repository. Every repository is has the same capabilities as every other, and it's up to a project to assign any special significance to one or more repositories. - Every file required by the version control system can be stored in a single hidden (.hg) file at the top directory of the repository. This is the single point of file namespace pollution for using the system in a project. That's probably enough to give the basic idea. *User Interface * As far as the interface goes, you type "hg init" at the top level of a project directory containing your source files to create a boilerplate "blank-slate" .hg directory. You type "hg status" see which new files Mercurial sees in your directory (all files show up as new files). You create a .hgignore file at the top level and add glob- or regex-style patterns to tell Mercurial about the files that it should not track (object files, backup files, etc.). You type "hg status" to confirm that the set of files Mercurial sees are the ones you want to add to the project and then "hg add" to add them all (you can also add a subset with "hg add <filenames>"). Finally, you do "hg commit" ("hg ci" is equivalent). At this point you have your first revision and a "hg status" should show nothing (or whatever files you chose not to add if you didn't add them all for the first commit). Typing "hg clone <hg_dir>" in a completely separate directory will create a copy (branch) of your project there. "hg clone <url>" or "hg clone ssh://[EMAIL PROTECTED]:22/<hg_dir>" will create a branch of a network-visible project. "hg pull [<repository>]" will copy/download any new revisions from the named repository, which defaults to the one you branched from. "hg update" is required to bring the state of your sandbox in sync with the downloaded revisions. An "hg merge" may be needed if there are conflicts. "hg push" will push changes back up to the place your directory was cloned from (again, an "hd update" and "hg commit" are required on the other end to make those changes official). Mercurial also has a built-in web server that can be started if you want people on a shared network to be able to browse your repository that way. In addition, it comes with a graphical tool (hgk) which allows you to see the history of a project including branches and merges. I'm fond of using tkdiff to diff my sandbox with a repository so I hacked a copy of tkdiff to do this (perhaps by now the official one supports it as well.) One more cool feature I have to mention: Mercurial has a "bisect" command that you can use for finding when a bug was introduced. You start bisect, which chooses a revision of the code for you. You run your regression test and run bisect again, telling it whether your test succeeded or failed. This chooses a new revision of the code using a binary search. In a small number of iterations, you find the change that broke the code. Clearly, with an automated test this is easy to automate. I haven't used the command yet, but am looking forward to it. I think we should create something similar for use with xcs, since this automates a useful process that many people find prohibitively tedious. * Mercurial Weaknesses* One weakness of mercurial is that it does not have support for storing multiple projects (i.e. you might want to selectively check out a single project without checking out the rest) in the same repository. If your project source includes large, independent subsystems and projects (a situation that I haven't dealt with yet in my own use of hg), it sounds like the way to handle this is to use relative symbolic links in the separate projects. But I've only read some references to doing this and I don't completely understand it yet. *Other Distributed Version Control Tools* There's another system, called Git, that is probably just as good as Mercurial. Git was created by Linus Torvalds and is based on almost the same set of ideas as mercurial. Git is quite a bit faster for many operations and uses a little less disk space, but last I checked it still had bad Windows support and Mercurial had better documentation. Otherwise, they seem pretty similar in terms of robustness and features, although I found git to be a bit more confusing on first glance because of the large number of additional "plumbing" commands that it makes available. Recently I took a second look and got the impression that there's a subset of the git commands that are almost identical to the hg commands. Although I originally tried git first, I later tried Mercurial and (perhaps because of the better documentation) never went back. I was also impressed by the fact that the original mercurial source code was just a couple thousand lines of pure python and was—at the time—within a factor of 2 of the speed of git—I appreciate the engineering required to write something in an elegant, concise way and still have it perform within an order of magnitude of C code written by a master of ultra-efficient OS-level C. The Mercurial source is larger now, and includes 3 small (<500 lines each) C files to speed up diffs, patch files, and some other low-level feature, but is still pretty small (~20KLOC vs. ~90KLOC for Git). In any case, both tools were so many times faster than any version control I'd used in the past that I really didn't care about that last percent of speed that git might give me. The tools have shared a lot of ideas and even some code (e.g. hgk is derived from a Git gui called gitk, git has copied some of hg's features) and I expect this to continue since they're both developed by active members of the linux kernel community. The rest of the distributed version control systems I know of are: bzr ("Bazaar", written in Python), monotone, GNU Arch, and darcs (written in Haskell). Darcs is supposed to be conceptually different from the others but I really don't know much about it. All of these others had significantly lower performance than git or hg last I checked, but things are changing fast. Bzr is supposed to have a slightly easier command line interface, but I find Mercurial's to be pretty easy already. There are tools for migrating projects from each of these to any of the others. SVK is a tool for use with SVN that lets you have a local repository. I think it's kind of like an svn repository on one side (the user side) and an svn client on the other (the side that talks to the real SVN server). A good friend uses it and recommends it for SVN users who want to be able to do "offline" publishes. My understanding, however, is that it doesn't provide any of the other features of the tools above. *Migrating to Mercurial* There are a few tools for migrating projects from SVN and CVS. The one called "hgsvn" seemed like the best for SVN last time I checked. The original import is kind of slow, though, since it has to do something like check out each revision from the SVN server. Some people seem to feel productive using hg locally while publishing to a cvs or svn server. I'm not sure how that works though. *Links* Official Page - http://www.selenic.com/mercurial/wiki/ Tutorial - http://www.selenic.com/mercurial/wiki/index.cgi/Tutorial Book - http://hgbook.red-bean.com/hgbook.html Google Tech Talk about Hg: http://www.youtube.com/watch?v=JExtkqzEoHY Linus' egotistical talk about Git: http://www.youtube.com/watch?v=4XpnKHJAok8 Randal Schwartz' talk about Git: http://www.youtube.com/watch?v=8dhZ9BXQgc4 Performance Benchmarks (of varying quality and age) - http://weblogs.mozillazine.org/jst/archives/2006/11/vcs_performance.html http://weblogs.mozillazine.org/jst/archives/2007/02/bzr_and_different_network_prot.html http://git.or.cz/gitwiki/GitBenchmarks https://lists.ubuntu.com/archives/bazaar/2006q2/011953.html -Damon
_______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com