I noticed people on this mailing list start talking about using blob deltas for compression, and the basic issue that the resulting files are too small for efficient filesystem storage. I thought about this a little and decided I should send out my ideas for discussion.
In my proposal, the current git object storage model (one compressed object per file) remains as the primary storage mechanism, however there would be some kind of backup mechanism based on multiple deltas grouped in one file. For example, suppose you're looking for an object with a hash of eab75ce51622aa312bb0b03572d43769f420c347 First you'd look at .git/objects/ea/b75ce51622aa312bb0b03572d43769f420c347 - if the file exists, that's your object. If the file does not exist, you'd then look for .git/deltas/ea/b, .git/deltas/ea/b7, .git/deltas/ea/b75, .git/deltas/ea/b75c, ... up to some maximum search path lenght. You stop at the first file you can find. Supposing that file is .git/deltas/ea/b7, it would contain a diff (let's assume unified format for now, though ideally it'd be better to have something that allows binary file deltas too) of many archived objects with hashes starting with eab7, compared to a different object (presumably some direct or indirect ancestor): diff -u 8f5ba0203e31204c5c052d995a5b4449226bcfb5 eab75ce51622aa312bb0b03572d43769f420c347 --- 8f5ba0203e31204c5c052d995a5b4449226bcfb5 +++ eab75ce51622aa312bb0b03572d43769f420c347 @@ -522,7 +522,7 @@ .... diff -u 77dc2cb94930017f62b55b9706cbadda8c90f650 eab71c51dbc62797d6c903203de44cc6a734c05c --- 77dc2cb94930017f62b55b9706cbadda8c90f650 +++ eab71c51dbc62797d6c903203de44cc6a734c05c @@ -560,13 +563,17 @@ ... Based on this delta file, we'd then look for the object 8f5ba0203e31204c5c052d995a5b4449226bcfb5 (this process could require recursively rebuilding that object) and try to build eab75ce51622aa312bb0b03572d43769f420c347 by applying the delta and then double checking the hash. To me the strenghts of this proposal would be: * It does not muddy the git object model - it just acts independently of it, as a way to rebuild git objects from deltas * Old objects can be compressed by creating a delta with a close ancestor, then erasing the original file storage for that object. The object delta can be appended to an existing delta file (which avoids the small-file storage issue), or if the delta file gets too big, it can be split off into 16 smaller files based on the hashes of the objects this file stores deltas for. * The system is flexible enough to explore different delta strategies. For example one could decide to keep one object every 10 in the database and store other 9 as deltas based on the immediate object ancestor, or any other tradeoff - and the system would still work the same (with different performance tradeoffs though). Does this sound insane ? Too complicated maybe ? Is there any kind of semi-standard binary-capable multiple-file diff format that could be used for this application instead of unified diffs ? -- Michel "Walken" Lespinasse "Bill Gates is a monocle and a Persian cat away from being the villain in a James Bond movie." -- Dennis Miller - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html