i spent some time over the past few weeks researching various open- source version control apps for use in vfx. thought i'd throw you all an update with my findings. as i explored different options and thought about the big picture, i came up with some features that i considered necessary and/or preferable.
---prerequisites--- free or very cheap ( perforce is $900/user x 100 users = $90,000 = non- option ) cross platform python api fast performance with binary files configurable to conserve disk space - ability to easily remove unneeded files from repo (aka 'obliterate') - limited file redundancy ---bonus--- no recursive special directories ( like .svn directories ) much of the prereq's are based around the notion that we'll be dealing with some very large files. we want to avoid replicating them all over our server because redundancy is a waste of disk space, network traffic, and copy time. so, what were my conclusions? subversion simply won't work. here's why: while subversion's python api seems quite top notch, subversion itself fails pretty miserably when it comes to binary performance and disk space usage. it stores all files in the repo using a delta algorithm, meaning each file is stored not as a whole file, but as the difference between itself and the previous commit. this has the advantage of saving disk space and of always having the diff on hand. however, calculating a delta for many large binary files -- and then later merging deltas to reform complete files -- takes prohibitively (read: insanely) long. take a look at this article for some performance tips and figures: http://www.ibm.com/developerworks/java/library/j-svnbins.html. unfortunately, their solution is to use svn's import and export commands, which store and retrieve binary files whole and uncompressed. the problem is that you don't get any version control on those files, so what's the bloody point? the second major failing is disk space usage. the delta algorithm saves space, but that space savings is far outweighed by several failings. first of all, every file you check out is stored twice. yep, EVERY file. in addition to your working copy it keeps an extra copy in the .svn directory so that IF you edit the file you can do a quick, offline diff. there's no way to turn off this "feature". so, if you're checking out 500GB of data, it's gonna be more like 1GB. all that extra disk space used up in every working copy is almost no benefit, because diff's between binary files are useless without a custom app to interpret the data. last in the disk space category, if a user accidentally checks in 100GB of cache data, or lets say, you're repo is getting very large and you want to wipe out some old versions of an asset that you know aren't being used, you cannot do so without going through some extreme pain. you have to use `svnadmin dump` to dump your entire repo to a text file, then use dumpfillter to filter through your data and remove what you don't want, then rebuild your repo. this process can take many hours if your repo is very large. the last part is a pet peev, and that's the recursive .svn directories. these are annoying to deal with because if you decide to switch out some directories in your working copy with some others of the same name and you expect it to simply use the new ones in their place, it won't work. you have to copy over all the .svn folders from the original into the new set. imagine how well this will work with artists! you would have to write scripts for moving and modifying these .svn directories and the artists would have to reliably use them instead of just dragging and dropping directories or the system would break down. i was pretty disappointed to finally come to this conclusion about subversion, but the fact is that it does what it's mean to do well, and managing large binary datasets is not what it's meant to do. so, i moved on and began applying my criteria to pretty much every revision control system i could find ( using this list: http://en.wikipedia.org/wiki/Comparison_of_revision_control_software ). most are cvs/svn derivatives with no real advantage in feature set. i ran away from anything that used delta compression on binary files, and at first i shied away from distributed systems because of what i read in the mercurial manual: " Because Subversion doesn’t store revision history on the client, it is well suited to managing projects that deal with lots of large, opaque binary files. If you check in fifty revisions to an incompressible 10MB file, Subversion’s client-side space usage stays constant The space used by any distributed SCM will grow rapidly in proportion to the number of revisions, because the differences between each revision are large. " essentially, if you have a 500GB repo, then that 500GB is copied to every working copy. ie: mercurial is worse than subversion with binary files ( and subversion is already pretty bad with binary files ). i shouldn't write off mercurial, though, because with the right features, it still might be viable, because as i shortly discovered, my favorite option ended up being a distributed system.... that system is "git". so far, i think it has the most potential of anything i've seen. it's distributed, but very flexible and has many different models for revision control, plus a lot of options to help save disk space / network traffic. it can even be configured to work like cvs/svn, if that is your desire. the project was started by linux torvalds, and as he put it: "It's not an SCM, it's a distribution and archival mechanism. I bet you could make a reasonable SCM on top of it, though. Another way of looking at it is to say that it's really a content-addressable filesystem, used to track directory trees." ( taken from this helpful site: http://utsl.gen.nz/talks/git-svn/intro.html ) the python api is provided by a 3rd party, which is a bit disappointing (ironic, coming from the guy who started pymel), but it exists and looks object-oriented enough. git doesn't use delta- compression, the amount of history copied from a repo can be limited or even shared via hard links, it has the ability to prune old commits, it has an option to pack away commits that are no longer used into even great compression, and it doesn't use annoying recursive directories. i haven't begun using git in a real-world test yet, but if you're looking for something to base a pipe on, this could be the horse to bet on. ultimately, i would really like to start an open-source asset management project, so take a look at git and see what you think. i'll let you know as i find out more. i haven't done a speed test on a large image sequence yet, that could still be a deal- breaker, but so far it "feels" fast. -chad --~--~---------~--~----~------------~-------~--~----~ Yours, Maya-Python Club Team. -~----------~----~----~----~------~----~------~--~---
