On Wed, Apr 20, 2005 at 10:30:15AM -0400, C. Scott Ananian wrote: Hi,
your code looks pretty cool. thank you! > On Wed, 20 Apr 2005, Martin Uecker wrote: > > >The other thing I don't like is the use of a sha1 > >for a complete file. Switching to some kind of hash > >tree would allow to introduce chunks later. This has > >two advantages: > > You can (and my code demonstrates/will demonstrate) still use a whole-file > hash to use chunking. With content prefixes, this takes O(N ln M) time > (where N is the file size and M is the number of chunks) to compute all > hashes; if subtrees can share the same prefix, then you can do this in > O(N) time (ie, as fast as possible, modulo a constant factor, which is > '2'). You don't *need* internal hashing functions. I don't understand this paragraph. What is an internal hash function? Your code seems to do exactly what I want. The hashes are computed recusively as in a hash tree with O(N ln N). The only difference between your design and a design based on a conventional (binary) hash tree seems to be that data is stored in the intermediate nodes too. > >It would allow git to scale to repositories of large > >binary files. And it would allow to build a very cool > >content transport algorithm for those repositories. > >This algorithm could combine all the advantages of > >bittorrent and rsync (without the cpu load). > > Yes, the big benefit of internal hashing is that it lets you check > validity of a chunk w/o having the entire file available. I'm not sure > that's terribly useful in this case. [And, if it is, then it can > obviously be done w/ other means.] If I don't miss anything essential, you can validate each treap piece at the moment you get it from the network with its SHA1 hash and then proceed with downloading the prefix and suffix tree (in parallel if you have more than one peer a la bittorrent). > >And it would allow trivial merging of patches which > >apply to different chunks of a file in exact the same > >way as merging changesets which apply to different > >files in a tree. > > I'm not sure anyone should be looking at chunks. To me, at least, they > are an object-store-implementation detail only. For merging, etc, we > should be looking at whole files, or (better) the whole repository. > The chunking algorithm is guaranteed not to respect semantic boundaries > (for *some* semantics of *some* file). You might be right. I just wanted to point out this possibility because it would allow to avoid calling external merging code for a lot of trivial merges. bye, Martin -- One night, when little Giana from Milano was fast asleep, she had a strange dream.
Description: Digital signature