Re: weaning distributions off tarballs: extended verification of git tags
On Sat, Feb 28, 2015, at 10:48 AM, Colin Walters wrote: Hi, TL;DR: Let's define a standard for embedding stronger checksums in tags and commit messages: https://github.com/cgwalters/homegit/blob/master/bin/git-evtag [time passes] I finally had a bit of time to pick this back up again in: https://github.com/cgwalters/git-evtag It should address the core concern here about stability of `git archive`. I prototyped it out with libgit2 because it was easier, and I'd like actually to be able to use this with older versions of git. But I think the next steps here are: - Validate the core design * Tree walking order * Submodule recursion * Use of SHA512 - Standardize it (Would like to see at least a stupid slow shell script implementation to cross-validate) - Add it as an option to `git tag`? Longer term: - Support adding `Git-EVTag` as a git note, so I can retroactively add stronger checksums to older git repositories - Anything else? -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: weaning distributions off tarballs: extended verification of git tags
On 03/03/2015 12:44 AM, Junio C Hamano wrote: [...] I was about to suggest another alternative. Pretend as if Git internally used SHA-512 (or whatever hash you want to use) instead of SHA-1, compute the object names that way. Recompute the contents of a tree object is by replacing the 20-byte SHA-1 field in it with a field with whatever necessary length to hold the longer object names of elements in the tree. But then a realization hit me: what new value will be placed in the parent field in the commit object? You cannot have SHA-512 variant of commit object name without recomputing the whole history. Now, if the final objective is to replace signature of tarballs, does it matter to cover the commit object, or is it sufficient to cover the tree contents? The original goal was to replace a tarball signature, for which the alternative that you described above seems quite elegant. If the goal were really to certify the entire history, then none of the proposals that I have seen so far is adequate anyway, because none of them propose to include better than the original SHA-1s of the parent commits. Including other metadata from the release commit does not seem useful to me; how valuable is it to know the author and commit message of the last commit that happened to make it into a release? It would be more useful to know the SHA-1 of that commit, but that would presumably be included elsewhere in the packaging data used by the distribution. [...] Michael -- Michael Haggerty mhag...@alum.mit.edu -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: weaning distributions off tarballs: extended verification of git tags
On Tue, Mar 3, 2015 at 1:12 AM, Joey Hess i...@joeyh.name wrote: I support this proposal, as someone who no longer releases tarballs of my software, when I can possibly avoid it. I have worried about signed tags / commits only being a SHA1 break away from useless. As to the implementation, checksumming the collection of raw objects is certainly superior to tar. Colin had suggested sorting the objects by checksum, but I don't think that is necessary. Just stream the commit object, then its tree object, followed by the content of each object listed in the tree, recursing into subtrees as necessary. That will be a stable stream for a given commit, or tree. It could be simplified a bit by using ls-tree -r (so you basically have a single big tree). Then hash commit, ls-tree -r output and all blobs pointed by ls-tree in listed order. -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: weaning distributions off tarballs: extended verification of git tags
Duy Nguyen pclo...@gmail.com writes: On Tue, Mar 3, 2015 at 1:12 AM, Joey Hess i...@joeyh.name wrote: I support this proposal, as someone who no longer releases tarballs of my software, when I can possibly avoid it. I have worried about signed tags / commits only being a SHA1 break away from useless. As to the implementation, checksumming the collection of raw objects is certainly superior to tar. Colin had suggested sorting the objects by checksum, but I don't think that is necessary. Just stream the commit object, then its tree object, followed by the content of each object listed in the tree, recursing into subtrees as necessary. That will be a stable stream for a given commit, or tree. It could be simplified a bit by using ls-tree -r (so you basically have a single big tree). Then hash commit, ls-tree -r output and all blobs pointed by ls-tree in listed order. What problem are you trying to solve here, though, by deliberately deviating what Git internally used to store these objects? If it is OK to ignore the tree boundary, then you probably do not even need trees in this secondary hash for validation in the first place. For example, you can hash a stream: commit object contents + N * (pathname + NUL + blob object contents) as long as the pathnames are sorted in a predictable order (like in the index order) in the output. That would be even simpler (I am not saying it is necessarily better, and by inference neither is your simplification). I was about to suggest another alternative. Pretend as if Git internally used SHA-512 (or whatever hash you want to use) instead of SHA-1, compute the object names that way. Recompute the contents of a tree object is by replacing the 20-byte SHA-1 field in it with a field with whatever necessary length to hold the longer object names of elements in the tree. But then a realization hit me: what new value will be placed in the parent field in the commit object? You cannot have SHA-512 variant of commit object name without recomputing the whole history. Now, if the final objective is to replace signature of tarballs, does it matter to cover the commit object, or is it sufficient to cover the tree contents? Among the ideas raised so far, I like what Joey suggested, combined with each should have 'type lengthNUL' header from Sam Vilain the best. That is, hash the stream: commit length NUL + commit object contents + tree length NUL + top level tree contents + ... list the entries in the order you would find by ... some defined traversal order people can agree on. with whatever the preferred strong hash function of the age. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: weaning distributions off tarballs: extended verification of git tags
On Tue, Mar 3, 2015 at 6:44 AM, Junio C Hamano gits...@pobox.com wrote: Duy Nguyen pclo...@gmail.com writes: On Tue, Mar 3, 2015 at 1:12 AM, Joey Hess i...@joeyh.name wrote: I support this proposal, as someone who no longer releases tarballs of my software, when I can possibly avoid it. I have worried about signed tags / commits only being a SHA1 break away from useless. As to the implementation, checksumming the collection of raw objects is certainly superior to tar. Colin had suggested sorting the objects by checksum, but I don't think that is necessary. Just stream the commit object, then its tree object, followed by the content of each object listed in the tree, recursing into subtrees as necessary. That will be a stable stream for a given commit, or tree. It could be simplified a bit by using ls-tree -r (so you basically have a single big tree). Then hash commit, ls-tree -r output and all blobs pointed by ls-tree in listed order. What problem are you trying to solve here, though, by deliberately deviating what Git internally used to store these objects? If it is OK to ignore the tree boundary, then you probably do not even need trees in this secondary hash for validation in the first place. For example, you can hash a stream: commit object contents + N * (pathname + NUL + blob object contents) as long as the pathnames are sorted in a predictable order (like in the index order) in the output. That would be even simpler (I am not saying it is necessarily better, and by inference neither is your simplification). I did nearly that [1]. But this morning I realized trees carry file permission. We should keep that in the final checksum as well. Now, if the final objective is to replace signature of tarballs, does it matter to cover the commit object, or is it sufficient to cover the tree contents? Among the ideas raised so far, I like what Joey suggested, combined with each should have 'type lengthNUL' header from Sam Vilain the best. That is, hash the stream: commit length NUL + commit object contents + tree length NUL + top level tree contents + ... list the entries in the order you would find by ... some defined traversal order people can agree on. with whatever the preferred strong hash function of the age. A bit harder to script, but simpler to provide from cat-file, I think. [1] http://article.gmane.org/gmane.comp.version-control.git/260211 -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: weaning distributions off tarballs: extended verification of git tags
On 03/02/2015 12:08 PM, Junio C Hamano wrote: I have a hazy recollection of what it would take to replace SHA-1 in git with something else; it should be possible (though tricky) to do it lazily, where a tree entry has bits (eg, some of the currently unused file mode bits) to denotes which hash algorithm is in use for the entry. However I don't think that got past idea stage... I think one reason why it didn't was because it would not work well. That bit that tells this is a new object or old would mean that a single tree can have many different object names, depending on which of its component entries are using that bit and which aren't. There goes the we know two trees with the same object name are identical without recursing into them optimization out the window. Also it would make it impossible to do what you suggest to Joey to do, i.e. exactly the same way that git does, once you start saying that a tree object can be encoded in more than one different ways, wouldn't it? I was reasoning that people would rather not have to rewrite their whole history in order to switch checksum algorithms, and that by allowing trees to be lazily converted that this would make things more efficient. However, I think I see your point here that this doesn't work. However, as a per-commit header, then only first commit which changes the hashing algorithm would have to re-checksum each of the files: but just in the current tree, not all the way back to the beginning of history. The delta logic should not have to care, and these objects with the same content but different object ID should pack perfectly, so long as git-pack-objects knows to re-checksum objects with the available hash algorithms and spot matches. Other operations like diff which span commit hashing algorithms might be able to get away with their existing object ranking algorithms and cache alternate object IDs for content as they operate to facilitate exact matching across hash algorithm changes. But actually, for the original problem - just producing a signature with a different hashing algorithm - probably it would be sufficient to just re-hash the current commit and the current tree recursively, and the mixed hash-algorithm case does not need to exist. But I'm just thinking it might not be too hard to make git nicely generic, to be well prepared for when a second pre-image attack on SHA-1 becomes practical. Sam -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: weaning distributions off tarballs: extended verification of git tags
On Sat, Feb 28, 2015, at 03:34 PM, Morten Welinder wrote: Is there a point to including a different checksum inside a git tag? If someone can break the SHA-1 checksum in the repository then the recorded SHA-256 checksum can be changed. In other words, wouldn't you be just as well off handing someone a SHA-1 commit id? The issue is more about what the checksum covers, as well as its strength. Git uses a hash tree, which means that an attacker only has to find a collision for *one* of the objects, and the signature is still valid. And that collision is valid for *every* commit that contains that object. This topic has been covered elsewhere pretty extensively, here's a link: https://www.whonix.org/forum/index.php/topic,538.msg4278.html#msg4278 Now I think rough consensus is still that git is secure or secure enough - but with this proposal I'm just trying to overcome the remaining conservatism. (Also, while those discussions were focusing on corrupting an existing repository, the attack model of MITM also exists, and there you don't have to worry about deltas, particularly if the attacker's goal is to get a downstream to do a build and thus execute their hostile code inside the downstream network). It's really not that expensive to do once per release, basically free for small repositories, and for a large one like the Linux kernel: $ cd ~/src/linux $ git describe v3.19-7478-g796e1c5 $ time /bin/sh -c 'git archive --format=tar HEAD|sha256sum' 4a5c5826cea188abd52fa50c663d17ebe1dfe531109fed4ddbd765a856f1966e - real0m3.772s user0m6.132s sys 0m0.279s $ With this proposal, the checksum covers an entire stream of objects for a given commit at once; making it significantly harder to find a collision. At least as good as checksummed tarballs, and arguably better since it's pre-compression. So to implement this, perhaps something like: $ git archive --format=raw as a base primitive, and: $ git tag --archive-raw-checksum=SHA256 -s -m ... ? git fsck could also learn to optionally use this. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: weaning distributions off tarballs: extended verification of git tags
I support this proposal, as someone who no longer releases tarballs of my software, when I can possibly avoid it. I have worried about signed tags / commits only being a SHA1 break away from useless. As to the implementation, checksumming the collection of raw objects is certainly superior to tar. Colin had suggested sorting the objects by checksum, but I don't think that is necessary. Just stream the commit object, then its tree object, followed by the content of each object listed in the tree, recursing into subtrees as necessary. That will be a stable stream for a given commit, or tree. -- see shy jo signature.asc Description: Digital signature
Re: weaning distributions off tarballs: extended verification of git tags
On 03/02/2015 10:12 AM, Joey Hess wrote: I support this proposal, as someone who no longer releases tarballs of my software, when I can possibly avoid it. I have worried about signed tags / commits only being a SHA1 break away from useless. As to the implementation, checksumming the collection of raw objects is certainly superior to tar. Colin had suggested sorting the objects by checksum, but I don't think that is necessary. Just stream the commit object, then its tree object, followed by the content of each object listed in the tree, recursing into subtrees as necessary. That will be a stable stream for a given commit, or tree. I would really just do it exactly the same way that git does: checksum the objects including their headers with the new hashes. I have a hazy recollection of what it would take to replace SHA-1 in git with something else; it should be possible (though tricky) to do it lazily, where a tree entry has bits (eg, some of the currently unused file mode bits) to denotes which hash algorithm is in use for the entry. However I don't think that got past idea stage... Sam -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: weaning distributions off tarballs: extended verification of git tags
Sam Vilain s...@vilain.net writes: As to the implementation, checksumming the collection of raw objects is certainly superior to tar. Colin had suggested sorting the objects by checksum, but I don't think that is necessary. Just stream the commit object, then its tree object, followed by the content of each object listed in the tree, recursing into subtrees as necessary. That will be a stable stream for a given commit, or tree. I would really just do it exactly the same way that git does: checksum the objects including their headers with the new hashes. I tend to agree that it is a good idea. I also suspect that would make the implementation simpler by allowing it to share more code, but I didn't look into it too deeply. I have a hazy recollection of what it would take to replace SHA-1 in git with something else; it should be possible (though tricky) to do it lazily, where a tree entry has bits (eg, some of the currently unused file mode bits) to denotes which hash algorithm is in use for the entry. However I don't think that got past idea stage... I think one reason why it didn't was because it would not work well. That bit that tells this is a new object or old would mean that a single tree can have many different object names, depending on which of its component entries are using that bit and which aren't. There goes the we know two trees with the same object name are identical without recursing into them optimization out the window. Also it would make it impossible to do what you suggest to Joey to do, i.e. exactly the same way that git does, once you start saying that a tree object can be encoded in more than one different ways, wouldn't it? -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
weaning distributions off tarballs: extended verification of git tags
Hi, TL;DR: Let's define a standard for embedding stronger checksums in tags and commit messages: https://github.com/cgwalters/homegit/blob/master/bin/git-evtag I think tarballs should go away as a source distribution mechanism in favor of pure git. I won't go into too many details of the why here (hopefully most of you agree!) but that's the background. Now, there are a few things that the classical tarball model provides: - Version numbers compatible with dpkg/rpm/etc - Do the same with your tag names, and use a well known scheme like v$VERSION - The assumption that this source has been run through some tests - Broken assumption, and regardless you want to rerun tests downstream - Hosting providers typically offer a strong checksum over the entire source - The topic of this post The above strawman code allows embedding the SHA256(git archive | tar). Now, in order to make this work, the byte output of git archive must never change in the future. I'm not sure how valid an assumption this is. Timestamps are set to the commit timestamp, but I could imagine someone wanting to come along later and tweak the output to be compatible with some variant of tar or something. We could define the checksum to be over the stream of raw objects, sorted by their checksum, and that way be independent of archiving format variations. Is there agreement that something like this makes sense in the git core? Does the concept make sense? Does anything like this exist today? Other thoughts/objections? -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: weaning distributions off tarballs: extended verification of git tags
Is there a point to including a different checksum inside a git tag? If someone can break the SHA-1 checksum in the repository then the recorded SHA-256 checksum can be changed. In other words, wouldn't you be just as well off handing someone a SHA-1 commit id? If you can guard the SHA-256 with a signature, you can do the same thing to the SHA-1. Or the tarball for that matter. Unrelatedly, your assumptions: Tar balls have too many degrees of freedom to rely on them being created identically in the future. - The assumption that this source has been run through some tests A perfectly valid assumption for some build systems, notably autotools. make distcheck is the only way my tarballs get made and they only get made when the checks succeed. (If your point was that many projects have too few tests, well, then I agree.) M. On Sat, Feb 28, 2015 at 9:48 AM, Colin Walters walt...@verbum.org wrote: Hi, TL;DR: Let's define a standard for embedding stronger checksums in tags and commit messages: https://github.com/cgwalters/homegit/blob/master/bin/git-evtag I think tarballs should go away as a source distribution mechanism in favor of pure git. I won't go into too many details of the why here (hopefully most of you agree!) but that's the background. Now, there are a few things that the classical tarball model provides: - Version numbers compatible with dpkg/rpm/etc - Do the same with your tag names, and use a well known scheme like v$VERSION - The assumption that this source has been run through some tests - Broken assumption, and regardless you want to rerun tests downstream - Hosting providers typically offer a strong checksum over the entire source - The topic of this post The above strawman code allows embedding the SHA256(git archive | tar). Now, in order to make this work, the byte output of git archive must never change in the future. I'm not sure how valid an assumption this is. Timestamps are set to the commit timestamp, but I could imagine someone wanting to come along later and tweak the output to be compatible with some variant of tar or something. We could define the checksum to be over the stream of raw objects, sorted by their checksum, and that way be independent of archiving format variations. Is there agreement that something like this makes sense in the git core? Does the concept make sense? Does anything like this exist today? Other thoughts/objections? -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: weaning distributions off tarballs: extended verification of git tags
On Sat, Feb 28, 2015 at 09:48:05AM -0500, Colin Walters wrote: The above strawman code allows embedding the SHA256(git archive | tar). Now, in order to make this work, the byte output of git archive must never change in the future. I'm not sure how valid an assumption this is. Timestamps are set to the commit timestamp, but I could imagine someone wanting to come along later and tweak the output to be compatible with some variant of tar or something. This is not a safe assumption. Unfortunately, kernel.org assumed that it was the case, and a change broke it. Let's please not make more code that does that. We could define the checksum to be over the stream of raw objects, sorted by their checksum, and that way be independent of archiving format variations. This would be a much better idea, assuming you mean raw git objects. For cryptographic purposes, it's important to make the item boundaries unambiguous, which is usually done using the length. Since the raw git objects include the length, this is sufficient. If you don't make the boundaries unambiguous, you get the problem you have with v3 OpenPGP keys, where somebody could move bytes from one value to another, creating a different key, but with the same fingerprint (hash value). -- brian m. carlson / brian with sandals: Houston, Texas, US +1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187 signature.asc Description: Digital signature