On Sat, Apr 16, 2005 at 11:11:00AM -0400, C. Scott Ananian wrote:
> On Sat, 16 Apr 2005, Martin Uecker wrote:
> >The right thing (TM) is to switch from SHA1 of compressed
> >content for the complete monolithic file to a merkle hash tree
> >of the uncompressed content. This would make the hash
> >independent of the actual storage method (chunked or not).
> It would certainly be nice to change to a hash of the uncompressed 
> content, rather than a hash of the compressed content, but it's not 
> strictly necessary, since files are fetched all at once: there's not 'read 
> subrange' operation on blobs.
> I assume 'merkle hash tree' is talking about:
>   http://www.open-content.net/specs/draft-jchapweske-thex-02.html
> ..which is very interesting, but not quite what I was thinking.
> The merkle hash approach seems to require fixed chunk boundaries.

I don't know what is written there, but I don't
consider fixed chunk boundaries part of the definition.

> The rsync approach does not use fixed chunk boundaries; this is necessary 
> to ensure good storage reuse for the expected case (ie; inserting a single 
> line at the start or in the middle of the file, which changes all the 
> chunk boundaries).

Yes. The chunk boundaries should be determined deterministically
from local properties of the data. Use a rolling checksum over
some small window and split the file it it hits a special value (0).
This is what the rsyncable patch to zlib does.

> Further, in the absence of subrange reads on blobs, it's not entirely 
> clear what using a merkle hash would buy you.

The whole design of git is a hash tree. If you extend
this tree structure into files you end up with merkle
hash trees. Everything else is just more complicated.


One night, when little Giana from Milano was fast asleep,
she had a strange dream.

Attachment: signature.asc
Description: Digital signature

Reply via email to