Re: weaning distributions off tarballs: extended verification of git tags

2015-07-07 Thread Colin Walters


On Sat, Feb 28, 2015, at 10:48 AM, Colin Walters wrote:
 Hi, 
 
 TL;DR: Let's define a standard for embedding stronger checksums in tags and 
 commit messages:
 https://github.com/cgwalters/homegit/blob/master/bin/git-evtag

[time passes]

I finally had a bit of time to pick this back up again in:

https://github.com/cgwalters/git-evtag

It should address the core concern here about stability of `git archive`.

I prototyped it out with libgit2 because it was easier, and I'd like actually 
to be able to use this with older versions of git.

But I think the next steps here are:

- Validate the core design
  * Tree walking order
  * Submodule recursion
  * Use of SHA512
- Standardize it
  (Would like to see at least a stupid slow shell script implementation to 
cross-validate)
- Add it as an option to `git tag`?

Longer term:
- Support adding `Git-EVTag` as a git note, so I can retroactively add stronger
  checksums to older git repositories
- Anything else?

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: weaning distributions off tarballs: extended verification of git tags

2015-03-05 Thread Michael Haggerty
On 03/03/2015 12:44 AM, Junio C Hamano wrote:
 [...]
 I was about to suggest another alternative.
 
 Pretend as if Git internally used SHA-512 (or whatever hash you
 want to use) instead of SHA-1, compute the object names that
 way.  Recompute the contents of a tree object is by replacing
 the 20-byte SHA-1 field in it with a field with whatever
 necessary length to hold the longer object names of elements in
 the tree.
 
 But then a realization hit me: what new value will be placed in the
 parent  field in the commit object?  You cannot have SHA-512
 variant of commit object name without recomputing the whole history.
 
 Now, if the final objective is to replace signature of tarballs,
 does it matter to cover the commit object, or is it sufficient to
 cover the tree contents?

The original goal was to replace a tarball signature, for which the
alternative that you described above seems quite elegant.

If the goal were really to certify the entire history, then none of the
proposals that I have seen so far is adequate anyway, because none of
them propose to include better than the original SHA-1s of the parent
commits.

Including other metadata from the release commit does not seem useful to
me; how valuable is it to know the author and commit message of the last
commit that happened to make it into a release? It would be more useful
to know the SHA-1 of that commit, but that would presumably be included
elsewhere in the packaging data used by the distribution.

 [...]

Michael

-- 
Michael Haggerty
mhag...@alum.mit.edu

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: weaning distributions off tarballs: extended verification of git tags

2015-03-02 Thread Duy Nguyen
On Tue, Mar 3, 2015 at 1:12 AM, Joey Hess i...@joeyh.name wrote:
 I support this proposal, as someone who no longer releases tarballs
 of my software, when I can possibly avoid it. I have worried about
 signed tags / commits only being a SHA1 break away from useless.

 As to the implementation, checksumming the collection of raw objects is
 certainly superior to tar. Colin had suggested sorting the objects by
 checksum, but I don't think that is necessary. Just stream the commit
 object, then its tree object, followed by the content of each object
 listed in the tree, recursing into subtrees as necessary. That will be a
 stable stream for a given commit, or tree.

It could be simplified a bit by using ls-tree -r (so you basically
have a single big tree). Then hash commit, ls-tree -r output and all
blobs pointed by ls-tree in listed order.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: weaning distributions off tarballs: extended verification of git tags

2015-03-02 Thread Junio C Hamano
Duy Nguyen pclo...@gmail.com writes:

 On Tue, Mar 3, 2015 at 1:12 AM, Joey Hess i...@joeyh.name wrote:
 I support this proposal, as someone who no longer releases tarballs
 of my software, when I can possibly avoid it. I have worried about
 signed tags / commits only being a SHA1 break away from useless.

 As to the implementation, checksumming the collection of raw objects is
 certainly superior to tar. Colin had suggested sorting the objects by
 checksum, but I don't think that is necessary. Just stream the commit
 object, then its tree object, followed by the content of each object
 listed in the tree, recursing into subtrees as necessary. That will be a
 stable stream for a given commit, or tree.

 It could be simplified a bit by using ls-tree -r (so you basically
 have a single big tree). Then hash commit, ls-tree -r output and all
 blobs pointed by ls-tree in listed order.

What problem are you trying to solve here, though, by deliberately
deviating what Git internally used to store these objects?  If it is
OK to ignore the tree boundary, then you probably do not even need
trees in this secondary hash for validation in the first place.

For example, you can hash a stream:

commit object contents +
N * (pathname + NUL + blob object contents)

as long as the pathnames are sorted in a predictable order (like
in the index order) in the output.  That would be even simpler (I
am not saying it is necessarily better, and by inference neither is
your simplification).

I was about to suggest another alternative.

Pretend as if Git internally used SHA-512 (or whatever hash you
want to use) instead of SHA-1, compute the object names that
way.  Recompute the contents of a tree object is by replacing
the 20-byte SHA-1 field in it with a field with whatever
necessary length to hold the longer object names of elements in
the tree.

But then a realization hit me: what new value will be placed in the
parent  field in the commit object?  You cannot have SHA-512
variant of commit object name without recomputing the whole history.

Now, if the final objective is to replace signature of tarballs,
does it matter to cover the commit object, or is it sufficient to
cover the tree contents?

Among the ideas raised so far, I like what Joey suggested, combined
with each should have 'type lengthNUL' header from Sam Vilain
the best.  That is, hash the stream:

commit length NUL + commit object contents +
tree length NUL + top level tree contents +
... list the entries in the order you would find by
... some defined traversal order people can agree on.

with whatever the preferred strong hash function of the age.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: weaning distributions off tarballs: extended verification of git tags

2015-03-02 Thread Duy Nguyen
On Tue, Mar 3, 2015 at 6:44 AM, Junio C Hamano gits...@pobox.com wrote:
 Duy Nguyen pclo...@gmail.com writes:

 On Tue, Mar 3, 2015 at 1:12 AM, Joey Hess i...@joeyh.name wrote:
 I support this proposal, as someone who no longer releases tarballs
 of my software, when I can possibly avoid it. I have worried about
 signed tags / commits only being a SHA1 break away from useless.

 As to the implementation, checksumming the collection of raw objects is
 certainly superior to tar. Colin had suggested sorting the objects by
 checksum, but I don't think that is necessary. Just stream the commit
 object, then its tree object, followed by the content of each object
 listed in the tree, recursing into subtrees as necessary. That will be a
 stable stream for a given commit, or tree.

 It could be simplified a bit by using ls-tree -r (so you basically
 have a single big tree). Then hash commit, ls-tree -r output and all
 blobs pointed by ls-tree in listed order.

 What problem are you trying to solve here, though, by deliberately
 deviating what Git internally used to store these objects?  If it is
 OK to ignore the tree boundary, then you probably do not even need
 trees in this secondary hash for validation in the first place.

 For example, you can hash a stream:

 commit object contents +
 N * (pathname + NUL + blob object contents)

 as long as the pathnames are sorted in a predictable order (like
 in the index order) in the output.  That would be even simpler (I
 am not saying it is necessarily better, and by inference neither is
 your simplification).

I did nearly that [1]. But this morning I realized trees carry file
permission. We should keep that in the final checksum as well.

 Now, if the final objective is to replace signature of tarballs,
 does it matter to cover the commit object, or is it sufficient to
 cover the tree contents?

 Among the ideas raised so far, I like what Joey suggested, combined
 with each should have 'type lengthNUL' header from Sam Vilain
 the best.  That is, hash the stream:

 commit length NUL + commit object contents +
 tree length NUL + top level tree contents +
 ... list the entries in the order you would find by
 ... some defined traversal order people can agree on.

 with whatever the preferred strong hash function of the age.

A bit harder to script, but simpler to provide from cat-file, I think.

[1] http://article.gmane.org/gmane.comp.version-control.git/260211
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: weaning distributions off tarballs: extended verification of git tags

2015-03-02 Thread Sam Vilain

On 03/02/2015 12:08 PM, Junio C Hamano wrote:

I have a
hazy recollection of what it would take to replace SHA-1 in git with
something else; it should be possible (though tricky) to do it lazily,
where a tree entry has bits (eg, some of the currently unused file
mode bits) to denotes which hash algorithm is in use for the entry.
However I don't think that got past idea stage...

I think one reason why it didn't was because it would not work well.
That bit that tells this is a new object or old would mean that a
single tree can have many different object names, depending on which
of its component entries are using that bit and which aren't.  There
goes the we know two trees with the same object name are identical
without recursing into them optimization out the window.

Also it would make it impossible to do what you suggest to Joey to
do, i.e. exactly the same way that git does, once you start saying
that a tree object can be encoded in more than one different ways,
wouldn't it?


I was reasoning that people would rather not have to rewrite their whole 
history in order to switch checksum algorithms, and that by allowing 
trees to be lazily converted that this would make things more 
efficient.  However, I think I see your point here that this doesn't work.


However, as a per-commit header, then only first commit which changes 
the hashing algorithm would have to re-checksum each of the files: but 
just in the current tree, not all the way back to the beginning of 
history.  The delta logic should not have to care, and these objects 
with the same content but different object ID should pack perfectly, so 
long as git-pack-objects knows to re-checksum objects with the available 
hash algorithms and spot matches.


Other operations like diff which span commit hashing algorithms might be 
able to get away with their existing object ranking algorithms and cache 
alternate object IDs for content as they operate to facilitate exact 
matching across hash algorithm changes.


But actually, for the original problem - just producing a signature with 
a different hashing algorithm - probably it would be sufficient to just 
re-hash the current commit and the current tree recursively, and the 
mixed hash-algorithm case does not need to exist.  But I'm just thinking 
it might not be too hard to make git nicely generic, to be well prepared 
for when a second pre-image attack on SHA-1 becomes practical.


Sam
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: weaning distributions off tarballs: extended verification of git tags

2015-03-02 Thread Colin Walters
On Sat, Feb 28, 2015, at 03:34 PM, Morten Welinder wrote:
 Is there a point to including a different checksum inside
 a git tag?  If someone can break the SHA-1 checksum
 in the repository then the recorded SHA-256 checksum can
 be changed.  In other words, wouldn't you be just as well
 off handing someone a SHA-1 commit id?

The issue is more about what the checksum covers, as
well as its strength.  Git uses a hash tree, which means
that an attacker only has to find a collision for *one* of
the objects, and the signature is still valid.  And that collision
is valid for *every* commit that contains that object.

This topic has been covered elsewhere pretty extensively,
here's a link:
https://www.whonix.org/forum/index.php/topic,538.msg4278.html#msg4278

Now I think rough consensus is still that git is secure or
secure enough - but with this proposal I'm just trying
to overcome the remaining conservatism.  (Also, while those
discussions were focusing on corrupting an existing repository,
the attack model of MITM also exists, and there
you don't have to worry about deltas, particularly if the
attacker's goal is to get a downstream to do a build
and thus execute their hostile code inside the downstream
network).

It's really not that expensive to do once per release,
basically free for small repositories, and for a large one like
the Linux kernel:

$ cd ~/src/linux
$ git describe
v3.19-7478-g796e1c5
$ time /bin/sh -c 'git archive --format=tar HEAD|sha256sum'
4a5c5826cea188abd52fa50c663d17ebe1dfe531109fed4ddbd765a856f1966e  -

real0m3.772s
user0m6.132s
sys 0m0.279s
$

With this proposal, the checksum
covers an entire stream of objects for a given commit at once;
making it significantly harder to find a collision.  At least as good as 
checksummed tarballs, and arguably better since it's
pre-compression.

So to implement this, perhaps something like:

$ git archive --format=raw

as a base primitive, and:

$ git tag --archive-raw-checksum=SHA256 -s -m ...

?

git fsck could also learn to optionally use this.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: weaning distributions off tarballs: extended verification of git tags

2015-03-02 Thread Joey Hess
I support this proposal, as someone who no longer releases tarballs
of my software, when I can possibly avoid it. I have worried about
signed tags / commits only being a SHA1 break away from useless.

As to the implementation, checksumming the collection of raw objects is
certainly superior to tar. Colin had suggested sorting the objects by
checksum, but I don't think that is necessary. Just stream the commit
object, then its tree object, followed by the content of each object
listed in the tree, recursing into subtrees as necessary. That will be a
stable stream for a given commit, or tree.

-- 
see shy jo


signature.asc
Description: Digital signature


Re: weaning distributions off tarballs: extended verification of git tags

2015-03-02 Thread Sam Vilain

On 03/02/2015 10:12 AM, Joey Hess wrote:

I support this proposal, as someone who no longer releases tarballs
of my software, when I can possibly avoid it. I have worried about
signed tags / commits only being a SHA1 break away from useless.

As to the implementation, checksumming the collection of raw objects is
certainly superior to tar. Colin had suggested sorting the objects by
checksum, but I don't think that is necessary. Just stream the commit
object, then its tree object, followed by the content of each object
listed in the tree, recursing into subtrees as necessary. That will be a
stable stream for a given commit, or tree.


I would really just do it exactly the same way that git does: checksum 
the objects including their headers with the new hashes.  I have a hazy 
recollection of what it would take to replace SHA-1 in git with 
something else; it should be possible (though tricky) to do it lazily, 
where a tree entry has bits (eg, some of the currently unused file mode 
bits) to denotes which hash algorithm is in use for the entry.  However 
I don't think that got past idea stage...


Sam
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: weaning distributions off tarballs: extended verification of git tags

2015-03-02 Thread Junio C Hamano
Sam Vilain s...@vilain.net writes:

 As to the implementation, checksumming the collection of raw objects is
 certainly superior to tar. Colin had suggested sorting the objects by
 checksum, but I don't think that is necessary. Just stream the commit
 object, then its tree object, followed by the content of each object
 listed in the tree, recursing into subtrees as necessary. That will be a
 stable stream for a given commit, or tree.

 I would really just do it exactly the same way that git does: checksum
 the objects including their headers with the new hashes.

I tend to agree that it is a good idea.  I also suspect that would
make the implementation simpler by allowing it to share more code,
but I didn't look into it too deeply.

 I have a
 hazy recollection of what it would take to replace SHA-1 in git with
 something else; it should be possible (though tricky) to do it lazily,
 where a tree entry has bits (eg, some of the currently unused file
 mode bits) to denotes which hash algorithm is in use for the entry.
 However I don't think that got past idea stage...

I think one reason why it didn't was because it would not work well.
That bit that tells this is a new object or old would mean that a
single tree can have many different object names, depending on which
of its component entries are using that bit and which aren't.  There
goes the we know two trees with the same object name are identical
without recursing into them optimization out the window.

Also it would make it impossible to do what you suggest to Joey to
do, i.e. exactly the same way that git does, once you start saying
that a tree object can be encoded in more than one different ways,
wouldn't it?

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


weaning distributions off tarballs: extended verification of git tags

2015-02-28 Thread Colin Walters
Hi, 

TL;DR: Let's define a standard for embedding stronger checksums in tags and 
commit messages:
https://github.com/cgwalters/homegit/blob/master/bin/git-evtag

I think tarballs should go away as a source distribution mechanism in favor of 
pure git.  I won't go into too many details of the why here (hopefully most 
of you agree!) but that's the background.

Now, there are a few things that the classical tarball model provides:

- Version numbers compatible with dpkg/rpm/etc
  - Do the same with your tag names, and use a well known scheme like 
v$VERSION
- The assumption that this source has been run through some tests
  - Broken assumption, and regardless you want to rerun tests downstream
- Hosting providers typically offer a strong checksum over the entire source
  - The topic of this post

The above strawman code allows embedding the SHA256(git archive | tar).  Now,
in order to make this work, the byte output of git archive must never change 
in the
future.  I'm not sure how valid an assumption this is.  Timestamps are set to 
the
commit timestamp, but I could imagine someone wanting to come along later
and tweak the output to be compatible with some variant of tar or something.

We could define the checksum to be over the stream of raw objects, sorted by 
their checksum,
and that way be independent of archiving format variations.

Is there agreement that something like this makes sense in the git core?  Does 
the
concept make sense?  Does anything like this exist today?  Other 
thoughts/objections?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: weaning distributions off tarballs: extended verification of git tags

2015-02-28 Thread Morten Welinder
Is there a point to including a different checksum inside
a git tag?  If someone can break the SHA-1 checksum
in the repository then the recorded SHA-256 checksum can
be changed.  In other words, wouldn't you be just as well
off handing someone a SHA-1 commit id?

If you can guard the SHA-256 with a signature, you can
do the same thing to the SHA-1.  Or the tarball for that matter.

Unrelatedly, your assumptions:

Tar balls have too many degrees of freedom to rely on them
being created identically in the future.

 - The assumption that this source has been run through some tests

A perfectly valid assumption for some build systems, notably
autotools.  make distcheck is the only way my tarballs get
made and they only get made when the checks succeed.
(If your point was that many projects have too few tests,
well, then I agree.)

M.



On Sat, Feb 28, 2015 at 9:48 AM, Colin Walters walt...@verbum.org wrote:
 Hi,

 TL;DR: Let's define a standard for embedding stronger checksums in tags and 
 commit messages:
 https://github.com/cgwalters/homegit/blob/master/bin/git-evtag

 I think tarballs should go away as a source distribution mechanism in favor 
 of pure git.  I won't go into too many details of the why here (hopefully 
 most of you agree!) but that's the background.

 Now, there are a few things that the classical tarball model provides:

 - Version numbers compatible with dpkg/rpm/etc
   - Do the same with your tag names, and use a well known scheme like 
 v$VERSION
 - The assumption that this source has been run through some tests
   - Broken assumption, and regardless you want to rerun tests downstream
 - Hosting providers typically offer a strong checksum over the entire source
   - The topic of this post

 The above strawman code allows embedding the SHA256(git archive | tar).  Now,
 in order to make this work, the byte output of git archive must never 
 change in the
 future.  I'm not sure how valid an assumption this is.  Timestamps are set to 
 the
 commit timestamp, but I could imagine someone wanting to come along later
 and tweak the output to be compatible with some variant of tar or something.

 We could define the checksum to be over the stream of raw objects, sorted by 
 their checksum,
 and that way be independent of archiving format variations.

 Is there agreement that something like this makes sense in the git core?  
 Does the
 concept make sense?  Does anything like this exist today?  Other 
 thoughts/objections?
 --
 To unsubscribe from this list: send the line unsubscribe git in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: weaning distributions off tarballs: extended verification of git tags

2015-02-28 Thread brian m. carlson

On Sat, Feb 28, 2015 at 09:48:05AM -0500, Colin Walters wrote:

The above strawman code allows embedding the SHA256(git archive | tar).  Now,
in order to make this work, the byte output of git archive must never change 
in the
future.  I'm not sure how valid an assumption this is.  Timestamps are set to 
the
commit timestamp, but I could imagine someone wanting to come along later
and tweak the output to be compatible with some variant of tar or something.


This is not a safe assumption.  Unfortunately, kernel.org assumed that 
it was the case, and a change broke it.  Let's please not make more code 
that does that.



We could define the checksum to be over the stream of raw objects, sorted by 
their checksum,
and that way be independent of archiving format variations.


This would be a much better idea, assuming you mean raw git objects. 
For cryptographic purposes, it's important to make the item boundaries 
unambiguous, which is usually done using the length.  Since the raw git 
objects include the length, this is sufficient.


If you don't make the boundaries unambiguous, you get the problem you 
have with v3 OpenPGP keys, where somebody could move bytes from one 
value to another, creating a different key, but with the same 
fingerprint (hash value).

--
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187


signature.asc
Description: Digital signature