On Wed, 20 Apr 2005, Martin Uecker wrote:
The other thing I don't like is the use of a sha1 for a complete file. Switching to some kind of hash tree would allow to introduce chunks later. This has two advantages:
You can (and my code demonstrates/will demonstrate) still use a whole-file hash to use chunking. With content prefixes, this takes O(N ln M) time (where N is the file size and M is the number of chunks) to compute all hashes; if subtrees can share the same prefix, then you can do this in O(N) time (ie, as fast as possible, modulo a constant factor, which is '2'). You don't *need* internal hashing functions.
It would allow git to scale to repositories of large binary files. And it would allow to build a very cool content transport algorithm for those repositories. This algorithm could combine all the advantages of bittorrent and rsync (without the cpu load).
Yes, the big benefit of internal hashing is that it lets you check validity of a chunk w/o having the entire file available. I'm not sure that's terribly useful in this case. [And, if it is, then it can obviously be done w/ other means.]
And it would allow trivial merging of patches which apply to different chunks of a file in exact the same way as merging changesets which apply to different files in a tree.
I'm not sure anyone should be looking at chunks. To me, at least, they are an object-store-implementation detail only. For merging, etc, we should be looking at whole files, or (better) the whole repository.
The chunking algorithm is guaranteed not to respect semantic boundaries (for *some* semantics of *some* file).
explosion JMTRAX DC KUBARK biowarfare LCFLUTTER ESMERALDITE for Dummies Hager Nader Israel General ZRMETAL Castro cryptographic Indonesia
( http://cscott.net/ )
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html