Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan

2017-06-19 Thread Johannes Schindelin
Hi Peff,

On Fri, 16 Jun 2017, Jeff King wrote:

> On Fri, Jun 16, 2017 at 03:24:19PM +0200, Johannes Schindelin wrote:
> 
> > I have no doubt that Visual Studio Team Services, GitHub and Atlassian
> > will eventually end up with FPGAs for hash computation. So that's
> > that.
> 
> I actually doubt this from the GitHub side. Hash performance is not even
> on our radar as a bottleneck. In most cases the problem is touching
> uncompressed data _at all_, not computing the hash over it (so things
> like reusing on-disk deltas are really important).

Thanks for pointing that out! As a mainly client-side person, I rarely get
insights into the server side...

Ciao,
Dscho


Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan

2017-06-16 Thread Ævar Arnfjörð Bjarmason

On Fri, Jun 16 2017, Jonathan Nieder jotted:
> Part of the reason I suggested previously that it would be helpful to
> try to benchmark Git with various hash functions (which didn't go over
> well, for some reason) is that it makes these comparisons more
> concrete.  Without measuring, it is hard to get a sense of the
> distribution of input sizes and how much practical effect the
> differences we are talking about have.

It would be great to have such benchmarks (I probably missed the "didn't
go over well" part), but FWIW you can get pretty close to this right now
in git by running various t/perf benchmarks with
BLKSHA1/OPENSSL/SHA1DC.

Between the three of those (particularly SHA1DC being slower than
OpenSSL) you get a similar performance difference as some SHA-1
v.s. SHA-256 benchmarks I've seen, so to the extent that we have
existing performance tests it's revealing to see what's slower & faster.

It makes a particularly big difference for e.g. p3400-rebase.sh.


Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan

2017-06-16 Thread Jonathan Nieder
Junio C Hamano wrote:
> Junio C Hamano  writes:
>> Adam Langley  writes:

>>> However, as I'm not a git developer, I've no opinion on whether the
>>> cost of carrying implementations of these functions is worth the speed
>>> vs using SHA-256, which can be assumed to be supported everywhere
>>> already.
>>
>> Thanks.
>>
>> My impression from this thread is that even though fast may be
>> better than slow, ubiquity trumps it for our use case, as long as
>> the thing is not absurdly and unusably slow, of course.  Which makes
>> me lean towards something older/more established like SHA-256, and
>> it would be a very nice bonus if it gets hardware acceleration more
>> widely than others ;-)
>
> Ah, I recall one thing that was mentioned but not discussed much in
> the thread: possible use of tree-hashing to exploit multiple cores
> hashing a large-ish payload.  As long as it is OK to pick a sound
> tree hash coding on top of any (secure) underlying hash function,
> I do not think the use of tree-hashing should not affect which exact
> underlying hash function is to be used, and I also am not convinced
> if we really want tree hashing (some codepaths that deal with a large
> payload wants to stream the data in single pass from head to tail)
> in the context of Git, but I am not a crypto person, so ...

Tree hashing also affects single-core performance because of the
availability of SIMD instructions.

That is how software implementations of e.g. blake2bp-256 and
SHA-256x16[1] are able to have competitive performance with (slightly
better performance than, at least in some cases) hardware
implementations of SHA-256.

It is also satisfying that we have options like these that are faster
than SHA-1.

All that said, SHA-256 seems like a fine choice, despite its worse
performance.  The wide availability of reasonable-quality
implementations (e.g. in Java you can use
'MessageDigest.getInstance("SHA-256")') makes it a very tempting one.

Part of the reason I suggested previously that it would be helpful to
try to benchmark Git with various hash functions (which didn't go over
well, for some reason) is that it makes these comparisons more
concrete.  Without measuring, it is hard to get a sense of the
distribution of input sizes and how much practical effect the
differences we are talking about have.

Thanks,
Jonathan

[1] https://eprint.iacr.org/2012/476.pdf


Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan

2017-06-16 Thread Junio C Hamano
Junio C Hamano  writes:

> Adam Langley  writes:
>
>> However, as I'm not a git developer, I've no opinion on whether the
>> cost of carrying implementations of these functions is worth the speed
>> vs using SHA-256, which can be assumed to be supported everywhere
>> already.
>
> Thanks.
>
> My impression from this thread is that even though fast may be
> better than slow, ubiquity trumps it for our use case, as long as
> the thing is not absurdly and unusably slow, of course.  Which makes
> me lean towards something older/more established like SHA-256, and
> it would be a very nice bonus if it gets hardware acceleration more
> widely than others ;-)

Ah, I recall one thing that was mentioned but not discussed much in
the thread: possible use of tree-hashing to exploit multiple cores
hashing a large-ish payload.  As long as it is OK to pick a sound
tree hash coding on top of any (secure) underlying hash function,
I do not think the use of tree-hashing should not affect which exact
underlying hash function is to be used, and I also am not convinced
if we really want tree hashing (some codepaths that deal with a large
payload wants to stream the data in single pass from head to tail)
in the context of Git, but I am not a crypto person, so ...




Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan

2017-06-16 Thread Junio C Hamano
Adam Langley  writes:

> However, as I'm not a git developer, I've no opinion on whether the
> cost of carrying implementations of these functions is worth the speed
> vs using SHA-256, which can be assumed to be supported everywhere
> already.

Thanks.

My impression from this thread is that even though fast may be
better than slow, ubiquity trumps it for our use case, as long as
the thing is not absurdly and unusably slow, of course.  Which makes
me lean towards something older/more established like SHA-256, and
it would be a very nice bonus if it gets hardware acceleration more
widely than others ;-)



Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan

2017-06-16 Thread Jeff King
On Fri, Jun 16, 2017 at 03:24:19PM +0200, Johannes Schindelin wrote:

> I have no doubt that Visual Studio Team Services, GitHub and Atlassian
> will eventually end up with FPGAs for hash computation. So that's that.

I actually doubt this from the GitHub side. Hash performance is not even
on our radar as a bottleneck. In most cases the problem is touching
uncompressed data _at all_, not computing the hash over it (so things
like reusing on-disk deltas are really important).

-Peff


Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan

2017-06-16 Thread Adam Langley
On Fri, Jun 16, 2017 at 6:24 AM, Johannes Schindelin
 wrote:
>
> And while I am really thankful that Adam chimed in, I think he would agree
> that BLAKE2 is a purposefully weakened version of BLAKE, for the benefit
> of speed

That is correct.

Although worth keeping in mind that the analysis results from the
SHA-3 process informed this rebalancing. Indeed, NIST proposed[1] to
do the same with Keccak before stamping it as SHA-3 (although
ultimately did not in the context of public feeling in late 2013). The
Keccak team have essentially done the same with K12. Thus there is
evidence of a fairly widespread belief that the SHA-3 parameters were
excessively cautious.

[1] https://docs.google.com/file/d/0BzRYQSHuuMYOQXdHWkRiZXlURVE/edit, slide 48

> (with the caveat that one of my experts disagrees that BLAKE2b
> would be faster than hardware-accelerated SHA-256).

The numbers given above for SHA-256 on Ryzen and Cortex-A72 must be
with hardware acceleration and I thank Brian Carlson for digging them
up as I hadn't seen them before.

I suggested above that BLAKE2bp (note the p at the end) might be
faster than hardware SHA-256 and that appears to be plausible based on
benchmarks[2] of that function. (With the caveat those numbers are for
Haswell and Skylake and so cannot be directly compared with Ryzen.)

K12 reports similar speeds on Skylake[3] and thus is also plausibly
faster than hardware SHA-256.

[2] https://github.com/sneves/blake2-avx2
[3] http://keccak.noekeon.org/KangarooTwelve.pdf

However, as I'm not a git developer, I've no opinion on whether the
cost of carrying implementations of these functions is worth the speed
vs using SHA-256, which can be assumed to be supported everywhere
already.


Cheers

AGL


Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan

2017-06-16 Thread Johannes Schindelin
Hi,

On Fri, 16 Jun 2017, Ævar Arnfjörð Bjarmason wrote:

> On Fri, Jun 16 2017, brian m. carlson jotted:
> 
> > On Fri, Jun 16, 2017 at 01:36:13AM +0200, Ævar Arnfjörð Bjarmason wrote:
> >
> >> So I don't follow the argument that we shouldn't weigh future HW
> >> acceleration highly just because you can't easily buy a laptop today
> >> with these features.
> >>
> >> Aside from that I think you've got this backwards, it's AMD that's
> >> adding SHA acceleration to their high-end Ryzen chips[1] but Intel is
> >> starting at the lower end this year with Goldmont which'll be in
> >> lower-end consumer devices[2]. If you read the github issue I linked
> >> to upthread[3] you can see that the cryptopp devs already tested
> >> their SHA accelerated code on a consumer Celeron[4] recently.
> >>
> >> I don't think Intel has announced the SHA extensions for future Xeon
> >> releases, but it seems given that they're going to have it there as
> >> well. Have there every been x86 extensions that aren't eventually
> >> portable across the entire line, or that they've ended up removing
> >> from x86 once introduced?
> >>
> >> In any case, I think by the time we're ready to follow-up the current
> >> hash refactoring efforts with actually changing the hash
> >> implementation many of us are likely to have laptops with these
> >> extensions, making this easy to test.
> >
> > I think you underestimate the life of hardware and software.  I have
> > servers running KVM development instances that have been running since
> > at least 2012.  Those machines are not scheduled for replacement
> > anytime soon.
> >
> > Whatever we deploy within the next year is going to run on existing
> > hardware for probably a decade, whether we want it to or not.  Most of
> > those machines don't have acceleration.
> 
> To clarify, I'm not dismissing the need to consider existing hardware
> without these acceleration functions or future processors without them.
> I don't think that makes any sense, we need to keep those in mind.
> 
> I was replying to a bit in your comment where you (it seems to me) were
> making the claim that we shouldn't consider the HW acceleration of
> certain hash functions either.

Yes, I also had the impression that it stressed the status quo quite a bit
too much.

We know for a fact that SHA-256 acceleration is coming to consumer CPUs.
We know of no plans for any of the other mentioned hash functions to
hardware-accelerate them in consumer CPUs.

And remember: for those who are affected most (humongous monorepos, source
code hosters), upgrading hardware is less of an issue than having a secure
hash function for the rest of us.

And while I am really thankful that Adam chimed in, I think he would agree
that BLAKE2 is a purposefully weakened version of BLAKE, for the benefit
of speed (with the caveat that one of my experts disagrees that BLAKE2b
would be faster than hardware-accelerated SHA-256). And while BLAKE has
seen roughly equivalent cryptanalysis as Keccak (which became SHA-3),
BLAKE2 has not.

That makes me *very* uneasy about choosing BLAKE2.

> > Furthermore, you need a reasonably modern crypto library to get hardware
> > acceleration.  OpenSSL has only recently gained support for it.  RHEL 7
> > does not currently support it, and probably never will.  That OS is
> > going to be around for the next 6 years.
> >
> > If we're optimizing for performance, I don't want to optimize for the
> > latest, greatest machines.  Those machines are going to outperform
> > everything else either way.  I'd rather optimize for something which
> > performs well on the whole everywhere.  There are a lot of developers
> > who have older machines, for cost reasons or otherwise.
> 
> We have real data showing that the intersection between people who care
> about the hash slowing down and those who can't afford the latest
> hardware is pretty much nil.
> 
> I.e. in 2.13.0 SHA-1 got slower, and pretty much nobody noticed or cared
> except Johannes Schindelin, myself & Christian Couder. This is because
> in practice hashing only becomes a bottleneck on huge monorepos that
> need to e.g. re-hash the contents of a huge index.

Indeed. I am still concerned about that. As you mention, though, it really
only affects users of ginormous monorepos, and of course source code
hosters.

The jury's still out on how much it impacts my colleagues, by the way.

I have no doubt that Visual Studio Team Services, GitHub and Atlassian
will eventually end up with FPGAs for hash computation. So that's that.

Side note: BLAKE is actually *not* friendly to hardware acceleration, I
have been told by one cryptography expert. In contrast, the Keccak team
claims SHA3-256 to be the easiest to hardware-accelerate, making it "a
green cryptographic primitive":
http://keccak.noekeon.org/is_sha3_slow.html

> > Here are some stats (cycles/byte for long messages):
> >
> >SHA-256BLAKE2b
> > Ryzen 1.89   3.06
> > Knight's Landing 

Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan

2017-06-16 Thread Ævar Arnfjörð Bjarmason

On Fri, Jun 16 2017, brian m. carlson jotted:

> On Fri, Jun 16, 2017 at 01:36:13AM +0200, Ævar Arnfjörð Bjarmason wrote:
>> On Fri, Jun 16, 2017 at 12:41 AM, brian m. carlson
>>  wrote:
>> > SHA-256 acceleration exists for some existing Intel platforms already.
>> > However, they're not practically present on anything but servers at the
>> > moment, and so I don't think the acceleration of SHA-256 is a
>> > something we should consider.
>>
>> Whatever next-gen hash Git ends up with is going to be in use for
>> decades, so what hardware acceleration exists in consumer products
>> right now is practically irrelevant, but what acceleration is likely
>> to exist for the lifetime of the hash existing *is* relevant.
>
> The life of MD5 was about 23 years (introduction to first document
> collision).  SHA-1 had about 22.  Decades, yes, but just barely.  SHA-2
> was introduced in 2001, and by the same estimate, we're a little over
> halfway through its life.

I'm talking about the lifetime of SHA-1 or $newhash's use in Git. As our
continued use of SHA-1 demonstrates the window of practical hash
function use extends well beyond the window from introduction to
published breakage.

It's also telling that SHA-1, which any cryptographer would have waived
you off from since around 2011, is just getting widely deployed HW
acceleration now in 2017. The practical use of hash functions far
exceeds their recommended use in new projects.

>> So I don't follow the argument that we shouldn't weigh future HW
>> acceleration highly just because you can't easily buy a laptop today
>> with these features.
>>
>> Aside from that I think you've got this backwards, it's AMD that's
>> adding SHA acceleration to their high-end Ryzen chips[1] but Intel is
>> starting at the lower end this year with Goldmont which'll be in
>> lower-end consumer devices[2]. If you read the github issue I linked
>> to upthread[3] you can see that the cryptopp devs already tested their
>> SHA accelerated code on a consumer Celeron[4] recently.
>>
>> I don't think Intel has announced the SHA extensions for future Xeon
>> releases, but it seems given that they're going to have it there as
>> well. Have there every been x86 extensions that aren't eventually
>> portable across the entire line, or that they've ended up removing
>> from x86 once introduced?
>>
>> In any case, I think by the time we're ready to follow-up the current
>> hash refactoring efforts with actually changing the hash
>> implementation many of us are likely to have laptops with these
>> extensions, making this easy to test.
>
> I think you underestimate the life of hardware and software.  I have
> servers running KVM development instances that have been running since
> at least 2012.  Those machines are not scheduled for replacement anytime
> soon.
>
> Whatever we deploy within the next year is going to run on existing
> hardware for probably a decade, whether we want it to or not.  Most of
> those machines don't have acceleration.

To clarify, I'm not dismissing the need to consider existing hardware
without these acceleration functions or future processors without
them. I don't think that makes any sense, we need to keep those in mind.

I was replying to a bit in your comment where you (it seems to me) were
making the claim that we shouldn't consider the HW acceleration of
certain hash functions either.

Clearly both need to be considered.

> Furthermore, you need a reasonably modern crypto library to get hardware
> acceleration.  OpenSSL has only recently gained support for it.  RHEL 7
> does not currently support it, and probably never will.  That OS is
> going to be around for the next 6 years.
>
> If we're optimizing for performance, I don't want to optimize for the
> latest, greatest machines.  Those machines are going to outperform
> everything else either way.  I'd rather optimize for something which
> performs well on the whole everywhere.  There are a lot of developers
> who have older machines, for cost reasons or otherwise.

We have real data showing that the intersection between people who care
about the hash slowing down and those who can't afford the latest
hardware is pretty much nil.

I.e. in 2.13.0 SHA-1 got slower, and pretty much nobody noticed or cared
except Johannes Schindelin, myself & Christian Couder. This is because
in practice hashing only becomes a bottleneck on huge monorepos that
need to e.g. re-hash the contents of a huge index.

> Here are some stats (cycles/byte for long messages):
>
>SHA-256BLAKE2b
> Ryzen 1.89   3.06
> Knight's Landing 19.00   5.65
> Cortex-A721.99   5.48
> Cortex-A57   11.81   5.47
> Cortex-A728.19  15.16
>
> In other words, BLAKE2b performs well uniformly across a wide variety of
> architectures even without acceleration.  I'd rather tell people that
> upgrading to a new hash algorithm is a performance win 

Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan

2017-06-15 Thread Jeff King
On Fri, Jun 16, 2017 at 06:10:22AM +0900, Mike Hommey wrote:

> > > What do the experts think or SHA512/256, which completely removes the
> > > concerns over length extension attack? (which I'd argue is better than
> > > sweeping them under the carpet)
> > 
> > I don't think it's sweeping them under the carpet. Git does not use the
> > hash as a MAC, so length extension attacks aren't a thing (and even if
> > we later wanted to use the same algorithm as a MAC, the HMAC
> > construction is a well-studied technique for dealing with it).
> 
> AIUI, length extension does make brute force collision attacks (which,
> really Shattered was) cheaper by allowing one to create the collision
> with a small message and extend it later.
> 
> This might not be a credible thread against git, but if we go by that
> standard, post-shattered Sha-1 is still fine for git. As a matter of
> fact, MD5 would also be fine: there is still, to this day, no preimage
> attack against them.

I think collision attacks are of interest to Git. But I would think
2^128 would be enough (TBH, 2^80 probably would have been enough for
SHA-1; it was the weaknesses that brought that down by a factor of a
million that made it a problem).

-Peff


Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan

2017-06-15 Thread brian m. carlson
On Fri, Jun 16, 2017 at 01:36:13AM +0200, Ævar Arnfjörð Bjarmason wrote:
> On Fri, Jun 16, 2017 at 12:41 AM, brian m. carlson
>  wrote:
> > SHA-256 acceleration exists for some existing Intel platforms already.
> > However, they're not practically present on anything but servers at the
> > moment, and so I don't think the acceleration of SHA-256 is a
> > something we should consider.
> 
> Whatever next-gen hash Git ends up with is going to be in use for
> decades, so what hardware acceleration exists in consumer products
> right now is practically irrelevant, but what acceleration is likely
> to exist for the lifetime of the hash existing *is* relevant.

The life of MD5 was about 23 years (introduction to first document
collision).  SHA-1 had about 22.  Decades, yes, but just barely.  SHA-2
was introduced in 2001, and by the same estimate, we're a little over
halfway through its life.

> So I don't follow the argument that we shouldn't weigh future HW
> acceleration highly just because you can't easily buy a laptop today
> with these features.
> 
> Aside from that I think you've got this backwards, it's AMD that's
> adding SHA acceleration to their high-end Ryzen chips[1] but Intel is
> starting at the lower end this year with Goldmont which'll be in
> lower-end consumer devices[2]. If you read the github issue I linked
> to upthread[3] you can see that the cryptopp devs already tested their
> SHA accelerated code on a consumer Celeron[4] recently.
> 
> I don't think Intel has announced the SHA extensions for future Xeon
> releases, but it seems given that they're going to have it there as
> well. Have there every been x86 extensions that aren't eventually
> portable across the entire line, or that they've ended up removing
> from x86 once introduced?
> 
> In any case, I think by the time we're ready to follow-up the current
> hash refactoring efforts with actually changing the hash
> implementation many of us are likely to have laptops with these
> extensions, making this easy to test.

I think you underestimate the life of hardware and software.  I have
servers running KVM development instances that have been running since
at least 2012.  Those machines are not scheduled for replacement anytime
soon.

Whatever we deploy within the next year is going to run on existing
hardware for probably a decade, whether we want it to or not.  Most of
those machines don't have acceleration.

Furthermore, you need a reasonably modern crypto library to get hardware
acceleration.  OpenSSL has only recently gained support for it.  RHEL 7
does not currently support it, and probably never will.  That OS is
going to be around for the next 6 years.

If we're optimizing for performance, I don't want to optimize for the
latest, greatest machines.  Those machines are going to outperform
everything else either way.  I'd rather optimize for something which
performs well on the whole everywhere.  There are a lot of developers
who have older machines, for cost reasons or otherwise.

Here are some stats (cycles/byte for long messages):

   SHA-256BLAKE2b
Ryzen 1.89   3.06
Knight's Landing 19.00   5.65
Cortex-A721.99   5.48
Cortex-A57   11.81   5.47
Cortex-A728.19  15.16

In other words, BLAKE2b performs well uniformly across a wide variety of
architectures even without acceleration.  I'd rather tell people that
upgrading to a new hash algorithm is a performance win either way, not
just if they have the latest hardware.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature


Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan

2017-06-15 Thread Ævar Arnfjörð Bjarmason
On Fri, Jun 16, 2017 at 12:41 AM, brian m. carlson
 wrote:
> On Thu, Jun 15, 2017 at 02:59:57PM -0700, Adam Langley wrote:
>> (I was asked to comment a few points in public by Jonathan.)
>>
>> I think this group can safely assume that SHA-256, SHA-512, BLAKE2,
>> K12, etc are all secure to the extent that I don't believe that making
>> comparisons between them on that axis is meaningful. Thus I think the
>> question is primarily concerned with performance and implementation
>> availability.
>>
>> I think any of the above would be reasonable choices. I don't believe
>> that length-extension is a concern here.
>>
>> SHA-512/256 will be faster than SHA-256 on 64-bit systems in software.
>> The graph at https://blake2.net/ suggests a 50% speedup on Skylake. On
>> my Ivy Bridge system, it's about 20%.
>>
>> (SHA-512/256 does not enjoy the same availability in common libraries 
>> however.)
>>
>> Both Intel and ARM have SHA-256 instructions defined. I've not seen
>> good benchmarks of them yet, but they will make SHA-256 faster than
>> SHA-512 when available. However, it's very possible that something
>> like BLAKE2bp will still be faster. Of course, BLAKE2bp does not enjoy
>> the ubiquity of SHA-256, but nor do you have to wait years for the CPU
>> population to advance for high performance.
>
> SHA-256 acceleration exists for some existing Intel platforms already.
> However, they're not practically present on anything but servers at the
> moment, and so I don't think the acceleration of SHA-256 is a
> something we should consider.

Whatever next-gen hash Git ends up with is going to be in use for
decades, so what hardware acceleration exists in consumer products
right now is practically irrelevant, but what acceleration is likely
to exist for the lifetime of the hash existing *is* relevant.

So I don't follow the argument that we shouldn't weigh future HW
acceleration highly just because you can't easily buy a laptop today
with these features.

Aside from that I think you've got this backwards, it's AMD that's
adding SHA acceleration to their high-end Ryzen chips[1] but Intel is
starting at the lower end this year with Goldmont which'll be in
lower-end consumer devices[2]. If you read the github issue I linked
to upthread[3] you can see that the cryptopp devs already tested their
SHA accelerated code on a consumer Celeron[4] recently.

I don't think Intel has announced the SHA extensions for future Xeon
releases, but it seems given that they're going to have it there as
well. Have there every been x86 extensions that aren't eventually
portable across the entire line, or that they've ended up removing
from x86 once introduced?

In any case, I think by the time we're ready to follow-up the current
hash refactoring efforts with actually changing the hash
implementation many of us are likely to have laptops with these
extensions, making this easy to test.

1. https://en.wikipedia.org/wiki/Intel_SHA_extensions
2. https://en.wikipedia.org/wiki/Goldmont
3. https://github.com/weidai11/cryptopp/issues/139#issuecomment-264283385
4. 
https://ark.intel.com/products/95594/Intel-Celeron-Processor-J3455-2M-Cache-up-to-2_3-GHz


Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan

2017-06-15 Thread brian m. carlson
On Thu, Jun 15, 2017 at 02:59:57PM -0700, Adam Langley wrote:
> (I was asked to comment a few points in public by Jonathan.)
> 
> I think this group can safely assume that SHA-256, SHA-512, BLAKE2,
> K12, etc are all secure to the extent that I don't believe that making
> comparisons between them on that axis is meaningful. Thus I think the
> question is primarily concerned with performance and implementation
> availability.
> 
> I think any of the above would be reasonable choices. I don't believe
> that length-extension is a concern here.
> 
> SHA-512/256 will be faster than SHA-256 on 64-bit systems in software.
> The graph at https://blake2.net/ suggests a 50% speedup on Skylake. On
> my Ivy Bridge system, it's about 20%.
> 
> (SHA-512/256 does not enjoy the same availability in common libraries 
> however.)
> 
> Both Intel and ARM have SHA-256 instructions defined. I've not seen
> good benchmarks of them yet, but they will make SHA-256 faster than
> SHA-512 when available. However, it's very possible that something
> like BLAKE2bp will still be faster. Of course, BLAKE2bp does not enjoy
> the ubiquity of SHA-256, but nor do you have to wait years for the CPU
> population to advance for high performance.

SHA-256 acceleration exists for some existing Intel platforms already.
However, they're not practically present on anything but servers at the
moment, and so I don't think the acceleration of SHA-256 is a
something we should consider.

The SUPERCOP benchmarks tell me that generally, on 64-bit systems where
acceleration is not available, SHA-256 is the slowest, followed by
SHA3-256.  BLAKE2b is the fastest.

If our goal is performance, then I would argue BLAKE2b-256 is the best
choice.  It is secure and extremely fast.  It does have the benefit that
we get to tell people that by moving away from SHA-1, they will get a
performance boost, pretty much no matter what the system.

BLAKE2bp may be faster, but it introduces additional implementation
complexity.  I'm not sure crypto libraries will implement it, but then
again, OpenSSL only implements BLAKE2b-512 at the moment.  I don't care
much either way, but we should add good tests to exercise the
implementation thoroughly.  We're generally going to need to ship our
own implementation anyway.

I've argued that SHA3-256 probably has the longest life and good
unaccelerated performance, and for that reason, I've preferred it.  But
if AGL says that they're all secure (and I generally think he knows
what he's talking about), we could consider performance more.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature


Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan

2017-06-15 Thread Adam Langley
(I was asked to comment a few points in public by Jonathan.)

I think this group can safely assume that SHA-256, SHA-512, BLAKE2,
K12, etc are all secure to the extent that I don't believe that making
comparisons between them on that axis is meaningful. Thus I think the
question is primarily concerned with performance and implementation
availability.

I think any of the above would be reasonable choices. I don't believe
that length-extension is a concern here.

SHA-512/256 will be faster than SHA-256 on 64-bit systems in software.
The graph at https://blake2.net/ suggests a 50% speedup on Skylake. On
my Ivy Bridge system, it's about 20%.

(SHA-512/256 does not enjoy the same availability in common libraries however.)

Both Intel and ARM have SHA-256 instructions defined. I've not seen
good benchmarks of them yet, but they will make SHA-256 faster than
SHA-512 when available. However, it's very possible that something
like BLAKE2bp will still be faster. Of course, BLAKE2bp does not enjoy
the ubiquity of SHA-256, but nor do you have to wait years for the CPU
population to advance for high performance.

So, overall, none of these choices should obviously be excluded. The
considerations at this point are not cryptographic and the tradeoff
between implementation ease and performance is one that the git
community would have to make.


Cheers

AGL


Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan

2017-06-15 Thread Mike Hommey
On Thu, Jun 15, 2017 at 09:01:45AM -0400, Jeff King wrote:
> On Thu, Jun 15, 2017 at 08:05:18PM +0900, Mike Hommey wrote:
> 
> > On Thu, Jun 15, 2017 at 12:30:46PM +0200, Johannes Schindelin wrote:
> > > Footnote *1*: SHA-256, as all hash functions whose output is essentially
> > > the entire internal state, are susceptible to a so-called "length
> > > extension attack", where the hash of a secret+message can be used to
> > > generate the hash of secret+message+piggyback without knowing the secret.
> > > This is not the case for Git: only visible data are hashed. The type of
> > > attacks Git has to worry about is very different from the length extension
> > > attacks, and it is highly unlikely that that weakness of SHA-256 leads to,
> > > say, a collision attack.
> > 
> > What do the experts think or SHA512/256, which completely removes the
> > concerns over length extension attack? (which I'd argue is better than
> > sweeping them under the carpet)
> 
> I don't think it's sweeping them under the carpet. Git does not use the
> hash as a MAC, so length extension attacks aren't a thing (and even if
> we later wanted to use the same algorithm as a MAC, the HMAC
> construction is a well-studied technique for dealing with it).

AIUI, length extension does make brute force collision attacks (which,
really Shattered was) cheaper by allowing one to create the collision
with a small message and extend it later.

This might not be a credible thread against git, but if we go by that
standard, post-shattered Sha-1 is still fine for git. As a matter of
fact, MD5 would also be fine: there is still, to this day, no preimage
attack against them.

Mike


Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan

2017-06-15 Thread Johannes Schindelin
Hi,

On Thu, 15 Jun 2017, Ævar Arnfjörð Bjarmason wrote:

> On Thu, Jun 15 2017, Jeff King jotted:
> 
> > On Thu, Jun 15, 2017 at 08:05:18PM +0900, Mike Hommey wrote:
> >
> >> On Thu, Jun 15, 2017 at 12:30:46PM +0200, Johannes Schindelin wrote:
> >>
> >> > Footnote *1*: SHA-256, as all hash functions whose output is
> >> > essentially the entire internal state, are susceptible to a
> >> > so-called "length extension attack", where the hash of a
> >> > secret+message can be used to generate the hash of
> >> > secret+message+piggyback without knowing the secret.  This is not
> >> > the case for Git: only visible data are hashed. The type of attacks
> >> > Git has to worry about is very different from the length extension
> >> > attacks, and it is highly unlikely that that weakness of SHA-256
> >> > leads to, say, a collision attack.
> >>
> >> What do the experts think or SHA512/256, which completely removes the
> >> concerns over length extension attack? (which I'd argue is better than
> >> sweeping them under the carpet)
> >
> > I don't think it's sweeping them under the carpet. Git does not use the
> > hash as a MAC, so length extension attacks aren't a thing (and even if
> > we later wanted to use the same algorithm as a MAC, the HMAC
> > construction is a well-studied technique for dealing with it).

I really tried to drive that point home, as it had been made very clear to
me that the length extension attack is something that Git need not concern
itself.

The length extension attack *only* comes into play when there are secrets
that are hashed. In that case, one would not want others to be able to
produce a valid hash *without* knowing the secrets. And SHA-256 allows to
"reconstruct" the internal state (which is the hash value) in order to
continue at any point, i.e. if the hash for secret+message is known, it is
easy to calculate the hash for secret+message+addition, without knowing
the secret at all.

That is exactly *not* the case with Git. In Git, what we want to hash is
known in its entirety. If the hash value were not identical to the
internal state, it would be easy enough to reconstruct, because *there are
no secrets*.

So please understand that even the direction that the length extension
attack takes is completely different than the direction any attack would
have to take that weakens SHA-256 for Git's purposes. As far as Git's
usage is concerned, SHA-256 has no known weaknesses.

It is *really, really, really* important to understand this before going
on to suggest another hash function such as SHA-512/256 (i.e. SHA-512
truncated to 256 bits), based only on that perceived weakness of SHA-256.

> > That said, SHA-512 is typically a little faster than SHA-256 on 64-bit
> > platforms. I don't know if that will change with the advent of
> > hardware instructions oriented towards SHA-256.
> 
> Quoting my own
> cacbzzx7jra2niwt9wsgaxnzs+gws8htugzwm8nay1gs87o8...@mail.gmail.com sent
> ~2 weeks ago to the list:
> 
> On Fri, Jun 2, 2017 at 7:54 PM, Jonathan Nieder 
> wrote:
> [...]
> > 4. When choosing a hash function, people may argue about performance.
> >It would be useful for run some benchmarks for git (running
> >the test suite, t/perf tests, etc) using a variety of hash
> >functions as input to such a discussion.
> 
> To the extent that such benchmarks matter, it seems prudent to heavily
> weigh them in favor of whatever seems to be likely to be the more
> common hash function going forward, since those are likely to get
> faster through future hardware acceleration.
> 
> E.g. Intel announced Goldmont last year which according to one SHA-1
> implementation improved from 9.5 cycles per byte to 2.7 cpb[1]. They
> only have acceleration for SHA-1 and SHA-256[2]
> 
> 1. https://github.com/weidai11/cryptopp/issues/139#issuecomment-264283385
> 
> 2. https://en.wikipedia.org/wiki/Goldmont
> 
> Maybe someone else knows of better numbers / benchmarks, but such a
> reduction in CBP likely makes it faster than SHA-512.

Very, very likely faster than SHA-512.

I'd like to stress explicitly that the Intel SHA extensions do *not* cover
SHA-512:

https://en.wikipedia.org/wiki/Intel_SHA_extensions

In other words, once those extensions become commonplace, SHA-256 will be
faster than SHA-512, hands down.

Ciao,
Dscho

Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan

2017-06-15 Thread Junio C Hamano
Brandon Williams  writes:

>> It would make a whole of a lot of sense to make that knob not Boolean,
>> but to specify which hash function is in use.
>
> 100% agree on this point.  I believe the current plan is to have the
> hashing function used for a repository be a repository format extension
> which would be a value (most likely a string like 'sha1', 'sha256',
> 'black2', etc) stored in a repository's .git/config.  This way, upon
> startup git will die or ignore a repository which uses a hashing
> function which it does not recognize or does not compiled to handle.
>
> I hope (and expect) that the end produce of this transition is a nice,
> clean hashing API and interface with sufficient abstractions such that
> if I wanted to switch to a different hashing function I would just need
> to implement the interface with the new hashing function and ensure that
> 'verify_repository_format' allows the new function.

Yup.  I thought that part has already been agreed upon, but it is a
good thing that somebody is writing it down (perhaps "again", if not
"for the first time").

Thanks.



Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan

2017-06-15 Thread Jonathan Nieder
Hi Dscho,

Johannes Schindelin wrote:

> From what I read, pretty much everybody who participated in the discussion
> was aware that the essential question is: performance vs security.

I don't completely agree with this framing.  The essential question is:
how to get the right security properties without abysmal performance.

> It turns out that we can have essentially both.
>
> SHA-256 is most likely the best-studied hash function we currently know
[... etc ...]

Thanks for a thoughtful restart to the discussion.  This is much more
concrete than your previous objections about process, and that is very
helpful.

In the interest of transparency: here are my current questions for
cryptographers to whom I have forwarded this thread.  Several of these
questions involve predictions or opinions, so in my ideal world we'd
want multiple, well reasoned answers to them.  Please feel free to
forward them to appropriate people or add more.

 1. Now it sounds like SHA-512/256 is the safest choice (see also Mike
Hommey's response to Dscho's message).  Please poke holes in my
understanding.

 2. Would you be willing to weigh in publicly on the mailing list? I
think that would be the most straightforward way to move this
forward (and it would give you a chance to ask relevant questions,
etc).  Feel free to contact me privately if you have any questions
about how this particular mailing list works.

 3. On the speed side, Dscho states "SHA-256 will be faster than BLAKE
(and even than BLAKE2) once the Intel and AMD CPUs with hardware
support for SHA-256 become common."  Do you agree?

 4. On the security side, Dscho states "to compete in the SHA-3
contest, BLAKE added complexity so that it would be roughly on par
with its competitors.  To allow for faster execution in software,
this complexity was *removed* from BLAKE to create BLAKE2, making
it weaker than SHA-256."  Putting aside the historical questions,
do you agree with this "weaker than" claim?

 5. On the security side, Dscho states, "The type of attacks Git has to
worry about is very different from the length extension attacks,
and it is highly unlikely that that weakness of SHA-256 leads to,
say, a collision attack", and Jeff King states, "Git does not use
the hash as a MAC, so length extension attacks aren't a thing (and
even if we later wanted to use the same algorithm as a MAC, the
HMAC construction is a well-studied technique for dealing with
it)."  Is this correct in spirit?  Is SHA-256 equally strong to
SHA-512/256 for Git's purposes, or are the increased bits of
internal state (or other differences) relevant?  How would you
compare the two functions' properties?

 6. On the speed side, Jeff King states "That said, SHA-512 is
typically a little faster than SHA-256 on 64-bit platforms. I
don't know if that will change with the advent of hardware
instructions oriented towards SHA-256."  Thoughts?

 7. If the answer to (2) is "no", do I have permission to quote or
paraphrase your replies that were given here?

Thanks, sincerely,
Jonathan


Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan

2017-06-15 Thread Brandon Williams
On 06/15, Johannes Schindelin wrote:
> Hi,
> 
> I thought it better to revive this old thread rather than start a new
> thread, so as to automatically reach everybody who chimed in originally.
> 
> On Mon, 6 Mar 2017, Brandon Williams wrote:
> 
> > On 03/06, brian m. carlson wrote:
> >
> > > On Sat, Mar 04, 2017 at 06:35:38PM -0800, Linus Torvalds wrote:
> > >
> > > > Btw, I do think the particular choice of hash should still be on the
> > > > table. sha-256 may be the obvious first choice, but there are
> > > > definitely a few reasons to consider alternatives, especially if
> > > > it's a complete switch-over like this.
> > > > 
> > > > One is large-file behavior - a parallel (or tree) mode could improve
> > > > on that noticeably. BLAKE2 does have special support for that, for
> > > > example. And SHA-256 does have known attacks compared to SHA-3-256
> > > > or BLAKE2 - whether that is due to age or due to more effort, I
> > > > can't really judge. But if we're switching away from SHA1 due to
> > > > known attacks, it does feel like we should be careful.
> > > 
> > > I agree with Linus on this.  SHA-256 is the slowest option, and it's
> > > the one with the most advanced cryptanalysis.  SHA-3-256 is faster on
> > > 64-bit machines (which, as we've seen on the list, is the overwhelming
> > > majority of machines using Git), and even BLAKE2b-256 is stronger.
> > > 
> > > Doing this all over again in another couple years should also be a
> > > non-goal.
> > 
> > I agree that when we decide to move to a new algorithm that we should
> > select one which we plan on using for as long as possible (much longer
> > than a couple years).  While writing the document we simply used
> > "sha256" because it was more tangible and easier to reference.
> 
> The SHA-1 transition *requires* a knob telling Git that the current
> repository uses a hash function different from SHA-1.
> 
> It would make *a whole of a lot of sense* to make that knob *not* Boolean,
> but to specify *which* hash function is in use.

100% agree on this point.  I believe the current plan is to have the
hashing function used for a repository be a repository format extension
which would be a value (most likely a string like 'sha1', 'sha256',
'black2', etc) stored in a repository's .git/config.  This way, upon
startup git will die or ignore a repository which uses a hashing
function which it does not recognize or does not compiled to handle.

I hope (and expect) that the end produce of this transition is a nice,
clean hashing API and interface with sufficient abstractions such that
if I wanted to switch to a different hashing function I would just need
to implement the interface with the new hashing function and ensure that
'verify_repository_format' allows the new function.

> 
> That way, it will be easier to switch another time when it becomes
> necessary.
> 
> And it will also make it easier for interested parties to use a different
> hash function in their infrastructure if they want.
> 
> And it lifts part of that burden that we have to consider *very carefully*
> which function to pick. We still should be more careful than in 2005, when
> Git was born, and when, incidentally, when the first attacks on SHA-1
> became known, of course. We were just lucky for almost 12 years.
> 
> Now, with Dunning-Kruger in mind, I feel that my degree in mathematics
> equips me with *just enough* competence to know just how little *even I*
> know about cryptography.
> 
> The smart thing to do, hence, was to get involved in this discussion and
> act as Lt Tawney Madison between us Git developers and experts in
> cryptography.
> 
> It just so happens that I work at a company with access to excellent
> cryptographers, and as we own the largest Git repository on the planet, we
> have a vested interest in ensuring Git's continued success.
> 
> After a couple of conversations with a couple of experts who I cannot
> thank enough for their time and patience, let alone their knowledge about
> this matter, it would appear that we may not have had a complete enough
> picture yet to even start to make the decision on the hash function to
> use.
> 

-- 
Brandon Williams


Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan

2017-06-15 Thread Ævar Arnfjörð Bjarmason

On Thu, Jun 15 2017, Jeff King jotted:

> On Thu, Jun 15, 2017 at 08:05:18PM +0900, Mike Hommey wrote:
>
>> On Thu, Jun 15, 2017 at 12:30:46PM +0200, Johannes Schindelin wrote:
>> > Footnote *1*: SHA-256, as all hash functions whose output is essentially
>> > the entire internal state, are susceptible to a so-called "length
>> > extension attack", where the hash of a secret+message can be used to
>> > generate the hash of secret+message+piggyback without knowing the secret.
>> > This is not the case for Git: only visible data are hashed. The type of
>> > attacks Git has to worry about is very different from the length extension
>> > attacks, and it is highly unlikely that that weakness of SHA-256 leads to,
>> > say, a collision attack.
>>
>> What do the experts think or SHA512/256, which completely removes the
>> concerns over length extension attack? (which I'd argue is better than
>> sweeping them under the carpet)
>
> I don't think it's sweeping them under the carpet. Git does not use the
> hash as a MAC, so length extension attacks aren't a thing (and even if
> we later wanted to use the same algorithm as a MAC, the HMAC
> construction is a well-studied technique for dealing with it).
>
> That said, SHA-512 is typically a little faster than SHA-256 on 64-bit
> platforms. I don't know if that will change with the advent of hardware
> instructions oriented towards SHA-256.

Quoting my own
cacbzzx7jra2niwt9wsgaxnzs+gws8htugzwm8nay1gs87o8...@mail.gmail.com sent
~2 weeks ago to the list:

On Fri, Jun 2, 2017 at 7:54 PM, Jonathan Nieder  wrote:
[...]
> 4. When choosing a hash function, people may argue about performance.
>It would be useful for run some benchmarks for git (running
>the test suite, t/perf tests, etc) using a variety of hash
>functions as input to such a discussion.

To the extent that such benchmarks matter, it seems prudent to heavily
weigh them in favor of whatever seems to be likely to be the more
common hash function going forward, since those are likely to get
faster through future hardware acceleration.

E.g. Intel announced Goldmont last year which according to one SHA-1
implementation improved from 9.5 cycles per byte to 2.7 cpb[1]. They
only have acceleration for SHA-1 and SHA-256[2]

1. https://github.com/weidai11/cryptopp/issues/139#issuecomment-264283385

2. https://en.wikipedia.org/wiki/Goldmont

Maybe someone else knows of better numbers / benchmarks, but such a
reduction in CBP likely makes it faster than SHA-512.


Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan

2017-06-15 Thread Jeff King
On Thu, Jun 15, 2017 at 08:05:18PM +0900, Mike Hommey wrote:

> On Thu, Jun 15, 2017 at 12:30:46PM +0200, Johannes Schindelin wrote:
> > Footnote *1*: SHA-256, as all hash functions whose output is essentially
> > the entire internal state, are susceptible to a so-called "length
> > extension attack", where the hash of a secret+message can be used to
> > generate the hash of secret+message+piggyback without knowing the secret.
> > This is not the case for Git: only visible data are hashed. The type of
> > attacks Git has to worry about is very different from the length extension
> > attacks, and it is highly unlikely that that weakness of SHA-256 leads to,
> > say, a collision attack.
> 
> What do the experts think or SHA512/256, which completely removes the
> concerns over length extension attack? (which I'd argue is better than
> sweeping them under the carpet)

I don't think it's sweeping them under the carpet. Git does not use the
hash as a MAC, so length extension attacks aren't a thing (and even if
we later wanted to use the same algorithm as a MAC, the HMAC
construction is a well-studied technique for dealing with it).

That said, SHA-512 is typically a little faster than SHA-256 on 64-bit
platforms. I don't know if that will change with the advent of hardware
instructions oriented towards SHA-256.

-Peff


Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan

2017-06-15 Thread Mike Hommey
On Thu, Jun 15, 2017 at 12:30:46PM +0200, Johannes Schindelin wrote:
> Footnote *1*: SHA-256, as all hash functions whose output is essentially
> the entire internal state, are susceptible to a so-called "length
> extension attack", where the hash of a secret+message can be used to
> generate the hash of secret+message+piggyback without knowing the secret.
> This is not the case for Git: only visible data are hashed. The type of
> attacks Git has to worry about is very different from the length extension
> attacks, and it is highly unlikely that that weakness of SHA-256 leads to,
> say, a collision attack.

What do the experts think or SHA512/256, which completely removes the
concerns over length extension attack? (which I'd argue is better than
sweeping them under the carpet)

Mike


Which hash function to use, was Re: RFC: Another proposed hash function transition plan

2017-06-15 Thread Johannes Schindelin
Hi,

I thought it better to revive this old thread rather than start a new
thread, so as to automatically reach everybody who chimed in originally.

On Mon, 6 Mar 2017, Brandon Williams wrote:

> On 03/06, brian m. carlson wrote:
>
> > On Sat, Mar 04, 2017 at 06:35:38PM -0800, Linus Torvalds wrote:
> >
> > > Btw, I do think the particular choice of hash should still be on the
> > > table. sha-256 may be the obvious first choice, but there are
> > > definitely a few reasons to consider alternatives, especially if
> > > it's a complete switch-over like this.
> > > 
> > > One is large-file behavior - a parallel (or tree) mode could improve
> > > on that noticeably. BLAKE2 does have special support for that, for
> > > example. And SHA-256 does have known attacks compared to SHA-3-256
> > > or BLAKE2 - whether that is due to age or due to more effort, I
> > > can't really judge. But if we're switching away from SHA1 due to
> > > known attacks, it does feel like we should be careful.
> > 
> > I agree with Linus on this.  SHA-256 is the slowest option, and it's
> > the one with the most advanced cryptanalysis.  SHA-3-256 is faster on
> > 64-bit machines (which, as we've seen on the list, is the overwhelming
> > majority of machines using Git), and even BLAKE2b-256 is stronger.
> > 
> > Doing this all over again in another couple years should also be a
> > non-goal.
> 
> I agree that when we decide to move to a new algorithm that we should
> select one which we plan on using for as long as possible (much longer
> than a couple years).  While writing the document we simply used
> "sha256" because it was more tangible and easier to reference.

The SHA-1 transition *requires* a knob telling Git that the current
repository uses a hash function different from SHA-1.

It would make *a whole of a lot of sense* to make that knob *not* Boolean,
but to specify *which* hash function is in use.

That way, it will be easier to switch another time when it becomes
necessary.

And it will also make it easier for interested parties to use a different
hash function in their infrastructure if they want.

And it lifts part of that burden that we have to consider *very carefully*
which function to pick. We still should be more careful than in 2005, when
Git was born, and when, incidentally, when the first attacks on SHA-1
became known, of course. We were just lucky for almost 12 years.

Now, with Dunning-Kruger in mind, I feel that my degree in mathematics
equips me with *just enough* competence to know just how little *even I*
know about cryptography.

The smart thing to do, hence, was to get involved in this discussion and
act as Lt Tawney Madison between us Git developers and experts in
cryptography.

It just so happens that I work at a company with access to excellent
cryptographers, and as we own the largest Git repository on the planet, we
have a vested interest in ensuring Git's continued success.

After a couple of conversations with a couple of experts who I cannot
thank enough for their time and patience, let alone their knowledge about
this matter, it would appear that we may not have had a complete enough
picture yet to even start to make the decision on the hash function to
use.

>From what I read, pretty much everybody who participated in the discussion
was aware that the essential question is: performance vs security.

It turns out that we can have essentially both.

SHA-256 is most likely the best-studied hash function we currently know
about (*maybe* SHA3-256 has been studied slightly more, but only
slightly). All the experts in the field banged on it with multiple sticks
and other weapons. And so far, they only found one weakness that does not
even apply to Git's usage [*1*]. For cryptography experts, this is the
ultimate measure of security: if something has been attacked that
intensely, by that many experts, for that long, with that little effect,
it is the best we got at the time.

And since SHA-256 has become the standard, and more importantly: since
SHA-256 was explicitly designed to allow for relatively inexpensive
hardware acceleration, this is what we will soon have: hardware support in
the form of, say, special CPU instructions. (That is what I meant by: we
can have performance *and* security.)

This is a rather important point to stress, by the way: BLAKE's design is
apparently *not* friendly to CPU instruction implementations. Meaning that
SHA-256 will be faster than BLAKE (and even than BLAKE2) once the Intel
and AMD CPUs with hardware support for SHA-256 become common.

I also heard something really worrisome about BLAKE2 that makes me want to
stay away from it (in addition to the difficulty it poses for hardware
acceleration): to compete in the SHA-3 contest, BLAKE added complexity so
that it would be roughly on par with its competitors. To allow for faster
execution in software, this complexity was *removed* from BLAKE to create
BLAKE2, making it weaker than SHA-256.

Another important point to 

Re: RFC: Another proposed hash function transition plan

2017-03-17 Thread Johannes Schindelin
Hi Kostis,

On Mon, 13 Mar 2017, ankostis wrote:

> On 13 March 2017 at 18:48, Jonathan Nieder  wrote:
> >
> > The Keccak Team wrote:
> >
> > > We have read your transition plan to move away from SHA-1 and
> > > noticed your intent to use SHA3-256 as the new hash function in the
> > > new Git repository format and protocol. Although this is a valid
> > > choice, we think that the new SHA-3 standard proposes alternatives
> > > that may also be interesting for your use cases.  As designers of
> > > the Keccak function family, we thought we could jump in the mail
> > > thread and present these alternatives.
> >
> > I indeed had some reservations about SHA3-256's performance.  The main
> > hash function we had in mind to compare against is blake2bp-256.  This
> > overview of other functions to compare against should end up being
> > very helpful.
> 
> What if some of us need this extra difficulty, and don't mind about the
> performance tax, because we need to refer to hashes 10 or 30 years from
> now, or even in the Post Quantum era?

If you need this extra difficulty, and if this extra difficulty would
imply a huge penalty for everybody else, it is safe to assume that that
extra difficulty would need to be an extra switch, off by default.

It simply shows that we put too much of a burden on SHA-1: we used it for
three separate purposes: to verify data integrity, to allow addressing
objects by their own content, and for signing entire commit histories
cryptographically (more as an afterthought, as I see it: the Linux project
provides the context where you never fetch from any untrusted source,
therefore cryptographically secure signatures are not quite as important
as the trust between maintainer and lieutenants).

We *will* have to separate those concerns, and maybe even switch to
different algorithms for the different concerns. There are much better
algorithms for validating data integrity, for example, including error
correction (which SHA-1 never wanted to do anyway).

In your case, I could imagine that you would simply require verifiable
cryptographic signatures (.asc files) to be committed together with the
documents; it would be much harder to find a collision where those
signatures still match (or a double collision where the forged document's
signature would collide with the non-forget document's signature, in
addition to the two documents colliding).

Another idea would be to use Jonathan Nieder's proposed transition plan
and simply extend it. That transition plan details how the objects would
be hashed with two algorithms locally and how to maintain a bidirectional
mapping between the two. You could simply piggyback on that code and
provide patches that allow for a third, configurable algorithm, and that
algorithm's hashes would simply be added to the commit objects and fsck
would then know to verify those, too. That would be an opt-in feature, of
course, so that only those who need the extra long term security have to
pay the price of a substantially slower hashing.

What we cannot do is to pick a super slow hash algorithm just to cater to
the use case where legal documents are managed, punishing everybody else
for using Git in the intended way: to manage source code.

Ciao,
Johannes


Re: RFC: Another proposed hash function transition plan

2017-03-13 Thread ankostis
On 13 March 2017 at 18:48, Jonathan Nieder  wrote:
>
> Hi,
>
> The Keccak Team wrote:
>
> > We have read your transition plan to move away from SHA-1 and noticed
> > your intent to use SHA3-256 as the new hash function in the new Git
> > repository format and protocol. Although this is a valid choice, we
> > think that the new SHA-3 standard proposes alternatives that may also be
> > interesting for your use cases.  As designers of the Keccak function
> > family, we thought we could jump in the mail thread and present these
> > alternatives.
>
> I indeed had some reservations about SHA3-256's performance.  The main
> hash function we had in mind to compare against is blake2bp-256.  This
> overview of other functions to compare against should end up being
> very helpful.

What if some of us need this extra difficulty, and don't mind about
the performance tax,
because we need to refer to hashes 10 or 30 years from now,
or even in the Post Quantum era?

Thanks,
  Kostis


Re: RFC: Another proposed hash function transition plan

2017-03-13 Thread Jonathan Nieder
Hi,

The Keccak Team wrote:

> We have read your transition plan to move away from SHA-1 and noticed
> your intent to use SHA3-256 as the new hash function in the new Git
> repository format and protocol. Although this is a valid choice, we
> think that the new SHA-3 standard proposes alternatives that may also be
> interesting for your use cases.  As designers of the Keccak function
> family, we thought we could jump in the mail thread and present these
> alternatives.

I indeed had some reservations about SHA3-256's performance.  The main
hash function we had in mind to compare against is blake2bp-256.  This
overview of other functions to compare against should end up being
very helpful.

Thanks for this.  When I have more questions (which I most likely
will) I'll keep you posted.

Sincerely,
Jonathan


Re: RFC: Another proposed hash function transition plan

2017-03-13 Thread The Keccak Team
Hello,

We have read your transition plan to move away from SHA-1 and noticed
your intent to use SHA3-256 as the new hash function in the new Git
repository format and protocol. Although this is a valid choice, we
think that the new SHA-3 standard proposes alternatives that may also be
interesting for your use cases.  As designers of the Keccak function
family, we thought we could jump in the mail thread and present these
alternatives.


SHA3-256, standardized in FIPS 202 [1], is a fixed-length hash function
that provides the same interface and security level as SHA-256 (FIPS
180-4). SHA3-256's primary goal is to be drop-in compatible with the
previous standard, and to allow a fast transition for applications that
would already use SHA-256.

Since your application did not use SHA-256, you are free to choose one
of the alternatives listed below.


* SHAKE128

  SHAKE128, defined in FIPS 202, is an eXtendable-Output Function (XOF)
  that generates digests of any size. In your case, you would use
  SHAKE128 the same way you would use SHA3-256, just truncating the
  output at 256 bits. In that case, SHAKE128 provides a security level
  of 128 bits against all generic attacks, including collisions,
  preimages, etc. We think this security level is appropriate for your
  application since this is the maximum you can get with 256-bit tags in
  the case of collision attacks, and this level is beyond computation
  reach for any adversary in the foreseeable future.

  The immediate benefit of using SHAKE128 versus SHA3-256 is a
  performance gain of roughly 20%, both for SW and HW implementations.
  On Intel Core i5-6500, SHAKE128 throughput is 430MiB/s.


* ParallelHash128

  ParallelHash128 (PH128), defined in NIST Special Publication 800-185
  (SP800-185, SHA-3 Derived Functions [2]), is a XOF implementing a tree
  hash mode on top of SHAKE128 (in fact cSHAKE128) to provide higher
  performance for large-file hashing. The tree mode is designed to
  exploit any available parallelism on the CPU, either through vector
  instructions or availability of multiple cores. Note that the chosen
  level of parallelism does not impact the final result, which improves
  interoperability.

  PH128 offers the same security level and interface as SHAKE128. So
  likewise, you just truncate the output at 256 bits.

  The net advantage of using PH128 over SHAKE128 is a huge performance
  boost when hashing big files.  The advantage depends of course on the
  number of cores used for hashing and their architecture. On an Intel
  Core i5-6500 (Skylake), with a single-core, PH128 is faster than
  SHAKE128 by a factor 3 and than SHA-1 by a factor 1.5 over long
  messages, with a throughput of 1320MiB/s.


* KangarooTwelve

  KangarooTwelve (K12) [3] is a very fast parallel and secure XOF we
  defined for applications that require higher performance that the FIPS
  202 and SP800-185 functions provide, while retaining the same
  flexibility and basis of security.

  K12 is very similar to PH128. It uses the same cryptographic primitive
  (Keccak-p, defined in FIPS 202), the same sponge construction, a
  similar tree hashing mode, and targets the same generic security level
  (128 bits). The main differences are the number of rounds for the
  inner permutation, which is reduced to 12, and the tree mode
  parameters, which are optimized for both small and long messages.

  Again, the benefit of using K12 over PH128 is performance. K12 is
  twice as fast as SHAKE128 for short messages, i.e. 820MiB/s on Intel
  Core i5-6500, and twice as fast as PH128 over long messages, i.e.
  2500MiB/s on the same platform.


If performance is not your primary concern, we suggest to use SHAKE128
as the default hash function, and optionally use ParallelHash128 for
hashing big files. Both functions offer a considerable security margin
and are standardized algorithms. On the longer term, provided HW
acceleration, SHAKE128 alone would easily outperform SHA-1 thanks to its
design.

If however you value first performance, or if you would like to promote
adoption of the new repository format by offering higher performance,
then KangarooTwelve is the right candidate. On modern CPU, K12 offers
equal performance as SHA-1 for small messages and outperforms it by a
factor 3 for long messages.  Regarding security, although K12 offers of
course a smaller security margin than other alternatives, it inherits
the security assurance built up for Keccak and the FIPS 202 functions.
As of today, the best practical attack broke 6 rounds of Keccak-p, with
2^50 computation effort. The 12 rounds of K12 offers then a comfortable
security margin [4].


Lately, we made a presentation at FOSDEM covering the latest development
over the Keccak family [5].  You can find reference and optimized
implementations of the algorithms listed above in the Keccak Code
Package [6]. Also, if you have questions, don't hesitate to contact us.


Kind regards,
The Keccak Team

Links
 [1]   FIPS 202,
   

Re: RFC: Another proposed hash function transition plan

2017-03-08 Thread Johannes Schindelin
Hi Ian,

On Wed, 8 Mar 2017, Ian Jackson wrote:

> Linus Torvalds writes ("Re: RFC: Another proposed hash function transition 
> plan"):
> > Of course, having written that, I now realize how it would cause
> > problems for the usual shit-for-brains case-insensitive filesystems.
> > So I guess base64 encoding doesn't work well for that reason.
> 
> AFAIAA object names occur in publicly-visible filenames only in notes
> tree objects, which are manipulated by git internally and do not
> necessarily need to appear in the filesystem.
> 
> The filenames in .git/objects/ can be in whatever encoding we like, so
> are not an obstacle.

Given that the idea was to encode the new hash in base64 or base85, we
*are* talking about an encoding. In that respect, yes, it can be whatever
encoding we like, and Linus just made a good point (with unnecessary foul
language) of explaining why base64/base85 is not that encoding.

Ciao,
Johannes


Re: RFC: Another proposed hash function transition plan

2017-03-08 Thread Johannes Schindelin
Hi Ian,

On Wed, 8 Mar 2017, Ian Jackson wrote:

> Few people use uppercase in ref names because of the case-insensitive
> filesystem problem;

Not true.

Ciao,
Johannes


Re: RFC: Another proposed hash function transition plan

2017-03-08 Thread Ian Jackson
Linus Torvalds writes ("Re: RFC: Another proposed hash function transition 
plan"):
> Also, since 256 isn't evenly divisible by 6, and because you'd want
> some way to explictly disambiguate the new hashes, the rule *could* be
> that the ASCII representation of a new hash is the base64 encoding of
> the 258-bit value that has "10" prepended to it as padding.
> 
> That way the first character of the hash would be guaranteed to not be
> a hex digit, because it would be in the range [g-v] (indexes 32..47).

We should arrange for this to be an uppercase, not a lowercase,
letter, for the reasons I explained in my own proposal.  To summarise:
It would be undesirable to further increase the overlap between object
names and ref names.  Few people use uppercase in ref names because of
the case-insensitive filesystem problem; so object names starting with
uppercase ascii are distinct from most object names.

> Of course, having written that, I now realize how it would cause
> problems for the usual shit-for-brains case-insensitive filesystems.
> So I guess base64 encoding doesn't work well for that reason.

AFAIAA object names occur in publicly-visible filenames only in notes
tree objects, which are manipulated by git internally and do not
necessarily need to appear in the filesystem.

The filenames in .git/objects/ can be in whatever encoding we like, so
are not an obstacle.

Ian.

-- 
Ian Jackson <ijack...@chiark.greenend.org.uk>   These opinions are my own.

If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
a private address which bypasses my fierce spamfilter.


Re: RFC: Another proposed hash function transition plan

2017-03-07 Thread Ian Jackson
Jonathan Nieder writes ("RFC: Another proposed hash function transition plan"):
> This past week we came up with this idea for what a transition to a new
> hash function for Git would look like.  I'd be interested in your
> thoughts (especially if you can make them as comments on the document,
> which makes it easier to address them and update the document).

Thanks for this.

This is a reasonable plan.  It corresponds to approaches (2) and (B)
of my survey mail from the other day.  Ie, two parallel homogeneous
hash trees, rather than a unified but heterogeneous hash tree, with
old vs new object names distinguished by length.

I still prefer my proposal with the mixed hash tree, mostly because
the handling of signatures here is very awkward, and because my
proposal does not involve altering object ids stored other than in the
git object graph (eg CI system databases, etc.)

One thing you've missed, I think, is notes: notes have to be dealt
with in a more complicated way.  Do you intend to rewrite the tree
objects for notes commits so that the notes are annotations for the
new names for the annotated objects ?  And if so, when ?

Also I think you need to specify how abbreviated object names are
interpreted.

Regards,
Ian.


Re: RFC: Another proposed hash function transition plan

2017-03-07 Thread Linus Torvalds
On Tue, Mar 7, 2017 at 10:57 AM, Ian Jackson
 wrote:
>
> Also I think you need to specify how abbreviated object names are
> interpreted.

One option might be to not use hex for the new hash, but base64 encoding.

That would make the full size ASCII hash encoding length roughly
similar (43 base64 characters rather than 40), which would offset some
of the new costs (longer filenames in the loose format, for example).

Also, since 256 isn't evenly divisible by 6, and because you'd want
some way to explictly disambiguate the new hashes, the rule *could* be
that the ASCII representation of a new hash is the base64 encoding of
the 258-bit value that has "10" prepended to it as padding.

That way the first character of the hash would be guaranteed to not be
a hex digit, because it would be in the range [g-v] (indexes 32..47).

Of course, the downside is that base64 encoded hashes can also end up
looking very much like real words, and now case would matter too.

The "use base64 with a "10" two-bit padding prepended" also means that
the natural loose format radix format would remain the first 2
characters of the hash, but due to the first character containing the
padding, it would be a fan-out of 2**10 rather than 2**12.

Of course, having written that, I now realize how it would cause
problems for the usual shit-for-brains case-insensitive filesystems.
So I guess base64 encoding doesn't work well for that reason.

Linus


Re: RFC: Another proposed hash function transition plan

2017-03-07 Thread Jeff King
On Mon, Mar 06, 2017 at 10:39:49AM -0800, Jonathan Tan wrote:

> The "nohash" thing was in the hope of requiring only one signature to sign
> all the hashes (in all the functions) that the user wants, while preserving
> round-tripping ability.

Thanks, this explained it very well.

I understand the tradeoff now, though I am still of the opinion that
simplicity is probably a more important goal.

In practice I'd imagine that anybody doing commit-signing would just
sign the more-secure hash, and people doing tag releases would probably
do a dual-sign to be verifiable by both old and new clients. Those are
infrequent enough that the extra computation probably doesn't matter.
But that's just my gut feeling.

-Peff


Re: RFC: Another proposed hash function transition plan

2017-03-06 Thread Mike Hommey
On Mon, Mar 06, 2017 at 03:40:30PM -0800, Jonathan Nieder wrote:
> David Lang wrote:
> 
> >> Translation table
> >> ~
> >> A fast bidirectional mapping between sha1-names and sha256-names of
> >> all local objects in the repository is kept on disk. The exact format
> >> of that mapping is to be determined.
> >>
> >> All operations that make new objects (e.g., "git commit") add the new
> >> objects to the translation table.
> >
> > This seems like a rather nontrival thing to design. It will need to
> > hold millions of mappings, and be quickly searchable from either
> > direction (sha1->new and new->sha1) while still be fairly fast to
> > insert new records into.
> 
> I am currently thinking of using LevelDB, since it has the advantages of
> being simple, already existing, and having already been ported to Java
> (allowing JGit can read and write the same format).
> 
> If that doesn't work, we'd try some other key-value store like Samba's
> tdb or Kyoto Cabinet.

FWIW, I'm using notes-like data to store mercurial->git mappings in
git-cinnabar, (ab)using the commit type in tree items. It's fast enough.

Mike


Re: RFC: Another proposed hash function transition plan

2017-03-06 Thread Jonathan Nieder
David Lang wrote:

>> Translation table
>> ~
>> A fast bidirectional mapping between sha1-names and sha256-names of
>> all local objects in the repository is kept on disk. The exact format
>> of that mapping is to be determined.
>>
>> All operations that make new objects (e.g., "git commit") add the new
>> objects to the translation table.
>
> This seems like a rather nontrival thing to design. It will need to
> hold millions of mappings, and be quickly searchable from either
> direction (sha1->new and new->sha1) while still be fairly fast to
> insert new records into.

I am currently thinking of using LevelDB, since it has the advantages of
being simple, already existing, and having already been ported to Java
(allowing JGit can read and write the same format).

If that doesn't work, we'd try some other key-value store like Samba's
tdb or Kyoto Cabinet.

Jonathan


Re: RFC: Another proposed hash function transition plan

2017-03-06 Thread Junio C Hamano
Linus Torvalds  writes:

> So *if* the new object format uses a git header line like
>
> "blob  \0"
>
> then it would inherently contain that mapping from 256-bit hash to the
> SHA1, but it would actually also protect against attacks on the new
> hash.

This is easy for blobs as you only need to hash twice.  I am not
sure if you can do the same for trees, though.  For that  to
be useful, the hash needs to be over the tree contents whose
references are expressed in , which in turn would mean...

... ah, you would read these  off of the object header in the
new world and you do not need to expand the whole thing.  OK, I see
how it could work.

> In fact, in particular for objects with internal format that
> differs between the two hashing models (ie trees and commits which to
> some degree are higher-value targets), it would make attacks really
> quite complicated, I suspect.
>
> And you wouldn't need those "hash" or "nohash" things at all. The old
> SHA1 would simply always be there, and cheap to look up (ie you
> wouldn't have to unpack the whole object).


Re: RFC: Another proposed hash function transition plan

2017-03-06 Thread Brandon Williams
On 03/06, Linus Torvalds wrote:
> On Mon, Mar 6, 2017 at 10:39 AM, Jonathan Tan  
> wrote:
> >
> > I think "nohash" can be explained in 2 points:
> 
> I do think that that was my least favorite part of the suggestion. Not
> just "nohash", but all the special "hash" lines too.
> 
> I would honestly hope that the design should not be about "other
> hashes". If you plan your expectations around the new hash being
> broken, something is wrong to begin with.
> 
> I do wonder if things wouldn't be simpler if the new format just
> included the SHA1 object name in the new object. Put it in the
> "header" line of the object, so that every time you look up an object,
> you just _see_ the SHA1 of that object. You can even think of it as an
> additional protection.
> 
> Btw, the multi-collision attack referenced earlier does _not_ work for
> an iterated hash that has a bigger internal state than the final hash.
> Which is actually a real argument against sha-256: the internal state
> of sha-256 is 256 bits, so if an attack can find collisions due to
> some weakness, you really can then generate exponential collisions by
> chaining a linear collision search together.
> 
> But for sha3-256 or blake2, the internal hash state is larger than the
> final hash, so now you need to generate collisions not in the 256
> bits, but in the much larger search space of the internal hash space
> if you want to generate those exponential collisions.
> 
> So *if* the new object format uses a git header line like
> 
> "blob  \0"
> 
> then it would inherently contain that mapping from 256-bit hash to the
> SHA1, but it would actually also protect against attacks on the new
> hash. In fact, in particular for objects with internal format that
> differs between the two hashing models (ie trees and commits which to
> some degree are higher-value targets), it would make attacks really
> quite complicated, I suspect.
> 
> And you wouldn't need those "hash" or "nohash" things at all. The old
> SHA1 would simply always be there, and cheap to look up (ie you
> wouldn't have to unpack the whole object).
> 
> Hmm?

I'll agree that the "hash" "nohash" bit isn't my favorite and is really
only there to address the signing of tags/commits in this new non-sha1
world.  I'm inclined to take a closer look at Jeff's suggestion which
simply has a signature for the hash that the signer cares about.

I don't know if keeping around the SHA1 for every object buys you all
that much.  It would add an additional layer of protection but you would
also need to compute the SHA1 for each object indefinitely (assuming you
include the SHA1 in new objects and not just converted objects).  The
hope would be that at some point you could not worry about SHA1 at all.
That may be difficult for projects with long history with commit msgs
which reference SHA1's of other commits (if you wanted to look up the
referenced commit, for example), but projects started in the new
non-sha1 world shouldn't have to ever compute a sha1.

Also, during this transition phase you would still need to maintain the
sha1<->sha256 translation table to make looking up objects by their sha1
name in a sha256 repo fast.  Otherwise I think it would take a
non-trivial amount of time to search a sha256 repo for a sha1 name.  So
if you do include the sha1 in the new object format then you would end
up with some duplicate information, which isn't the end of the world.

-- 
Brandon Williams


Re: RFC: Another proposed hash function transition plan

2017-03-06 Thread Linus Torvalds
On Mon, Mar 6, 2017 at 10:39 AM, Jonathan Tan  wrote:
>
> I think "nohash" can be explained in 2 points:

I do think that that was my least favorite part of the suggestion. Not
just "nohash", but all the special "hash" lines too.

I would honestly hope that the design should not be about "other
hashes". If you plan your expectations around the new hash being
broken, something is wrong to begin with.

I do wonder if things wouldn't be simpler if the new format just
included the SHA1 object name in the new object. Put it in the
"header" line of the object, so that every time you look up an object,
you just _see_ the SHA1 of that object. You can even think of it as an
additional protection.

Btw, the multi-collision attack referenced earlier does _not_ work for
an iterated hash that has a bigger internal state than the final hash.
Which is actually a real argument against sha-256: the internal state
of sha-256 is 256 bits, so if an attack can find collisions due to
some weakness, you really can then generate exponential collisions by
chaining a linear collision search together.

But for sha3-256 or blake2, the internal hash state is larger than the
final hash, so now you need to generate collisions not in the 256
bits, but in the much larger search space of the internal hash space
if you want to generate those exponential collisions.

So *if* the new object format uses a git header line like

"blob  \0"

then it would inherently contain that mapping from 256-bit hash to the
SHA1, but it would actually also protect against attacks on the new
hash. In fact, in particular for objects with internal format that
differs between the two hashing models (ie trees and commits which to
some degree are higher-value targets), it would make attacks really
quite complicated, I suspect.

And you wouldn't need those "hash" or "nohash" things at all. The old
SHA1 would simply always be there, and cheap to look up (ie you
wouldn't have to unpack the whole object).

Hmm?

   Linus


Re: RFC: Another proposed hash function transition plan

2017-03-06 Thread Jonathan Tan

On 03/06/2017 12:43 AM, Jeff King wrote:

Overall the basics of the conversion seem sound to me. The "nohash"
things seems more complicated than I think it ought to be, which
probably just means I'm missing something.  I left a few related
comments on the google doc, so I won't repeat them here.


I think "nohash" can be explained in 2 points:
 1. When creating signed objects, "nohash" is almost never written. Just
create the object as usual and add "hash" lines for every other hash
function that you want the signature to cover.
 2. When converting from function A to function B, add "nohash B" if
there were no "hash B" lines in the original object.

The "nohash" thing was in the hope of requiring only one signature to 
sign all the hashes (in all the functions) that the user wants, while 
preserving round-tripping ability.


Maybe some examples would help to address the apparent complexity. These 
examples are the same as those in the document. I'll also show future 
compatibility with a hypothetical NEW hash function, and extend the rule 
about signing/verification to 'sign in the earliest supported hash 
function in ({object's hash function} + {functions in "hash" lines} - 
{function in "nohash" line})'.


Example 1 (existing signed commit)

  nohash sha256  nohash new
  hash sha1 ...  hash sha1 ...

This object was probably created in a SHA-1 repository with no knowledge 
that we were going to transition to SHA256 (but there is nothing 
preventing us from creating the middle or right object and then 
translating it to the other functions).


Example 2 (recommended way to sign a commit in a SHA256 repo)

hash sha256 ...   hash sha1 ...  nohash new
 hash sha1 ...
 hash sha256 ...

This is the recommended way to create a SHA256 object in a SHA256 repo. 
The rule about signing/verification (as stated above) is to sign in 
SHA-1, so when signing or verifying, we convert the object to SHA-1 and 
use that as the payload. Note that the signature covers both the SHA-1 
and SHA256 hashes, and that existing Git implementations can verify the 
signature.


Example 3 (a signer that does not care about SHA-1 anymore)

nohash sha1  nohash new
hash sha256 ...  hash sha256 ...

If we were to create a SHA256 object without any mentions of SHA-1, the 
rule about signing/verification (as stated above) states that the 
signature payload is the SHA256 object. This means that existing Git 
implementations cannot verify the signature, but we can still round-trip 
to SHA-1 and back without losing any information (as far as I can tell).


Re: RFC: Another proposed hash function transition plan

2017-03-06 Thread Junio C Hamano
Jeff King  writes:

>> You can use the doc URL
>> 
>>  https://goo.gl/gh2Mzc
>
> I'd encourage anybody following along to follow that link. I almost
> didn't, but there are a ton of comments there (I'm not sure how I feel
> about splitting the discussion off the list, though).

I am sure how I feel about it---we should really discourage it,
unless it is an effort to help polishing an early draft for wider
distribution and discussion.

> I don't think we do this right now, but you can actually find the entry
> (and exit) points of a pack during the index-pack step. Basically:

We have code to do the "entry point" computation in index-pack
already, I think, in 81a04b01 ("index-pack: --clone-bundle option",
2016-03-03).

> I don't think using the "want"s as the entry points is unreasonable,
> though. The server _shouldn't_ generally be sending us other cruft.

That's true.


Re: RFC: Another proposed hash function transition plan

2017-03-06 Thread Brandon Williams
On 03/06, brian m. carlson wrote:
> On Sat, Mar 04, 2017 at 06:35:38PM -0800, Linus Torvalds wrote:
> > On Fri, Mar 3, 2017 at 5:12 PM, Jonathan Nieder  wrote:
> > >
> > > This document is still in flux but I thought it best to send it out
> > > early to start getting feedback.
> > 
> > This actually looks very reasonable if you can implement it cleanly
> > enough. In many ways the "convert entirely to a new 256-bit hash" is
> > the cleanest model, and interoperability was at least my personal
> > concern. Maybe your model solves it (devil in the details), in which
> > case I really like it.
> 
> If you think you can do it, I'm all for it.
> 
> > Btw, I do think the particular choice of hash should still be on the
> > table. sha-256 may be the obvious first choice, but there are
> > definitely a few reasons to consider alternatives, especially if it's
> > a complete switch-over like this.
> > 
> > One is large-file behavior - a parallel (or tree) mode could improve
> > on that noticeably. BLAKE2 does have special support for that, for
> > example. And SHA-256 does have known attacks compared to SHA-3-256 or
> > BLAKE2 - whether that is due to age or due to more effort, I can't
> > really judge. But if we're switching away from SHA1 due to known
> > attacks, it does feel like we should be careful.
> 
> I agree with Linus on this.  SHA-256 is the slowest option, and it's the
> one with the most advanced cryptanalysis.  SHA-3-256 is faster on 64-bit
> machines (which, as we've seen on the list, is the overwhelming majority
> of machines using Git), and even BLAKE2b-256 is stronger.
> 
> Doing this all over again in another couple years should also be a
> non-goal.

I agree that when we decide to move to a new algorithm that we should
select one which we plan on using for as long as possible (much longer
than a couple years).  While writing the document we simply used
"sha256" because it was more tangible and easier to reference.

> -- 
> brian m. carlson / brian with sandals: Houston, Texas, US
> +1 832 623 2791 | https://www.crustytoothpaste.net/~bmc | My opinion only
> OpenPGP: https://keybase.io/bk2204



-- 
Brandon Williams


Re: RFC: Another proposed hash function transition plan

2017-03-06 Thread Jeff King
On Mon, Mar 06, 2017 at 10:29:33AM +0100, ankostis wrote:

> On 5 March 2017 at 12:02, David Lang  wrote:
> >> Translation table
> >> ~
> >> A fast bidirectional mapping between sha1-names and sha256-names of
> >> all local objects in the repository is kept on disk. The exact format
> >> of that mapping is to be determined.
> >>
> >> All operations that make new objects (e.g., "git commit") add the new
> >> objects to the translation table.
> >
> >
> > This seems like a rather nontrival thing to design. It will need to hold
> > millions of mappings, and be quickly searchable from either direction
> > (sha1->new and new->sha1) while still be fairly fast to insert new records
> > into.
> >
> > For Linux, just the list of hashes recording the commits is going to be in
> > the millions, whiel the list of hashes of individual files for all those
> > commits is going to be substantially larger.
> 
> Apologies if it is a stupid idea, but could we avoid the mappings-table
> just by
> hard-linking to the same object from both (or more) hashes?
> So instead of creating a text-db format, just use the filesystem.

No, for a few reasons:

  1. Most of these objects will not be in the filesystem at all, but
 rather in a packfile.

  2. It's not just a different hash over the same bytes. The sha256-name
 is taken over the sha256-content (which refers to other objects
 using sha256). So they really are different objects. You probably
 wouldn't keep the sha1 version around separately, but rather
 generate it on the fly during a push to a sha1 server.

  3. You really need to be able to take a sha256 name and convert it to
 a sha1 and vice versa. Hardlinks don't help with that, because they
 only point in one direction. That get you to the same _content_,
 but not the other name (and I guess this is where your "look up the
 name and then compute the other digest comes in, but that's
 probably too expensive to be workable).

I do think updating the mapping could potentially be deferred until
interacting with a sha1 server. But because it needs to be generated in
reverse-topological order, it's conceptually easier to do it one object
at a time.

-Peff


Re: RFC: Another proposed hash function transition plan

2017-03-06 Thread Jeff King
On Fri, Mar 03, 2017 at 05:12:51PM -0800, Jonathan Nieder wrote:

> This past week we came up with this idea for what a transition to a new
> hash function for Git would look like.  I'd be interested in your
> thoughts (especially if you can make them as comments on the document,
> which makes it easier to address them and update the document).

Overall it's an interesting idea. I thought at first that you were
suggesting servers do on-the-fly conversion, but after a more careful
reading that isn't the case. And I don't think that would work, because
the conversion is expensive.

So this pushes the conversion cost onto the clients who decide to move
to SHA-256. That may be a problem for sites which have a lot of clients
(like CI hosts). But I guess they would just stick with SHA-1 as long as
possible, until the upstream repo switches (and that _is_ a per-repo
flag day, because the upstream host isn't going to convert back to SHA-1
on the fly to serve the old clients).

> You can use the doc URL
> 
>  https://goo.gl/gh2Mzc

I'd encourage anybody following along to follow that link. I almost
didn't, but there are a ton of comments there (I'm not sure how I feel
about splitting the discussion off the list, though).

> Goals
> -
> 1. The transition to SHA256 can be done one local repository at a time.
>a. Requiring no action by any other party.
>b. A SHA256 repository can communicate with SHA-1 Git servers and
>   clients (push/fetch).
>c. Users can use SHA-1 and SHA256 identifiers for objects
>   interchangeably.
>d. New signed objects make use of a stronger hash function than
>   SHA-1 for their security guarantees.
> 2. Allow a complete transition away from SHA-1.
>a. Local metadata for SHA-1 compatibility can be dropped in a
>   repository if compatibility with SHA-1 is no longer needed.

I suspect we'll never get away from keeping the mapping table. You'll
need at least the sha1->sha256 table if you want to look up names found
in historic commit messages, mailing list posts, etc.

And you'll need the sha256->sha1 table if you want to verify the gpg
signatures on old tags and commits. That might be something people are
willing to drop, though.

> After negotiation, the server sends a packfile containing the
> requested objects. We convert the packfile to SHA-256 format using the
> following steps:
> 
> 1. index-pack: inflate each object in the packfile and compute its
>SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against
>objects the client has locally. These objects can be looked up using
>the translation table and their sha1-content read as described above
>to resolve the deltas.
> 2. topological sort: starting at the "want"s from the negotiation
>phase, walk through objects in the pack and emit a list of them in
>topologically sorted order. (This list only contains objects
>reachable from the "wants". If the pack from the server contained
>additional extraneous objects, then they will be discarded.)

I don't think we do this right now, but you can actually find the entry
(and exit) points of a pack during the index-pack step. Basically:

  1. Keep a hashmap of objects mentioned in the pack.

  2. When we process an object's content (i.e., compute its hash), also
 parse it for any object references. Add entries in the hashmap for
 any object mentioned this way. Mark the entry for the object we
 processed with a "HAVE" bit, and mark any referenced object with a
 "REF" bit.

  3. After processing all objects, anything with a "HAVE" but no "REF"
 is an entry point to the pack (i.e., something that we should have
 asked for with a want). Anything with a "REF" but not a "HAVE" is
 an exit point (i.e., an object that we are expected to already have
 in our repo).

 (I've thought about this before because we could possibly shortcut
 the connectivity check using the exit points. It's complicated by
 the fact that we don't assume the transitive presence of objects
 unless they are reachable).

I don't think using the "want"s as the entry points is unreasonable,
though. The server _shouldn't_ generally be sending us other cruft.

I do wonder if you might be able to omit the extra object-graph walk
from your step 2, if you could assign "depths" to each object during
step 1 instead of HAVE/REF bits. The trouble, of course, is that you're
not visiting the nodes in the right order (so given two trees, you're
not sure if one might eventually be a child of the other; how do you
assign their depths?). I have a feeling there's a proof that it's
impossible, but I might just not be clever enough.


Overall the basics of the conversion seem sound to me. The "nohash"
things seems more complicated than I think it ought to be, which
probably just means I'm missing something.  I left a few related
comments on the google doc, so I won't repeat them here.

-Peff


Re: RFC: Another proposed hash function transition plan

2017-03-05 Thread brian m. carlson
On Sat, Mar 04, 2017 at 06:35:38PM -0800, Linus Torvalds wrote:
> On Fri, Mar 3, 2017 at 5:12 PM, Jonathan Nieder  wrote:
> >
> > This document is still in flux but I thought it best to send it out
> > early to start getting feedback.
> 
> This actually looks very reasonable if you can implement it cleanly
> enough. In many ways the "convert entirely to a new 256-bit hash" is
> the cleanest model, and interoperability was at least my personal
> concern. Maybe your model solves it (devil in the details), in which
> case I really like it.

If you think you can do it, I'm all for it.

> Btw, I do think the particular choice of hash should still be on the
> table. sha-256 may be the obvious first choice, but there are
> definitely a few reasons to consider alternatives, especially if it's
> a complete switch-over like this.
> 
> One is large-file behavior - a parallel (or tree) mode could improve
> on that noticeably. BLAKE2 does have special support for that, for
> example. And SHA-256 does have known attacks compared to SHA-3-256 or
> BLAKE2 - whether that is due to age or due to more effort, I can't
> really judge. But if we're switching away from SHA1 due to known
> attacks, it does feel like we should be careful.

I agree with Linus on this.  SHA-256 is the slowest option, and it's the
one with the most advanced cryptanalysis.  SHA-3-256 is faster on 64-bit
machines (which, as we've seen on the list, is the overwhelming majority
of machines using Git), and even BLAKE2b-256 is stronger.

Doing this all over again in another couple years should also be a
non-goal.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature


Re: RFC: Another proposed hash function transition plan

2017-03-05 Thread David Lang

Translation table
~
A fast bidirectional mapping between sha1-names and sha256-names of
all local objects in the repository is kept on disk. The exact format
of that mapping is to be determined.

All operations that make new objects (e.g., "git commit") add the new
objects to the translation table.


This seems like a rather nontrival thing to design. It will need to hold 
millions of mappings, and be quickly searchable from either direction (sha1->new 
and new->sha1) while still be fairly fast to insert new records into.


For Linux, just the list of hashes recording the commits is going to be in the 
millions, whiel the list of hashes of individual files for all those commits is 
going to be substantially larger.


David Lang


Re: RFC: Another proposed hash function transition plan

2017-03-04 Thread Linus Torvalds
On Fri, Mar 3, 2017 at 5:12 PM, Jonathan Nieder  wrote:
>
> This document is still in flux but I thought it best to send it out
> early to start getting feedback.

This actually looks very reasonable if you can implement it cleanly
enough. In many ways the "convert entirely to a new 256-bit hash" is
the cleanest model, and interoperability was at least my personal
concern. Maybe your model solves it (devil in the details), in which
case I really like it.

I do think that if you end up essentially converting the objects
without really having any true backwards compatibility at the object
layer (just the translation code), you should seriously look at doing
some other changes at the same time. Like not using zlib compression,
it really is very slow.

Btw, I do think the particular choice of hash should still be on the
table. sha-256 may be the obvious first choice, but there are
definitely a few reasons to consider alternatives, especially if it's
a complete switch-over like this.

One is large-file behavior - a parallel (or tree) mode could improve
on that noticeably. BLAKE2 does have special support for that, for
example. And SHA-256 does have known attacks compared to SHA-3-256 or
BLAKE2 - whether that is due to age or due to more effort, I can't
really judge. But if we're switching away from SHA1 due to known
attacks, it does feel like we should be careful.

Linus


RFC: Another proposed hash function transition plan

2017-03-03 Thread Jonathan Nieder
Hi,

This past week we came up with this idea for what a transition to a new
hash function for Git would look like.  I'd be interested in your
thoughts (especially if you can make them as comments on the document,
which makes it easier to address them and update the document).

This document is still in flux but I thought it best to send it out
early to start getting feedback.

We tried to incorporate some thoughts from the thread
http://public-inbox.org/git/20170223164306.spg2avxzukkgg...@kitenet.net
but it is a little long so it is easy to imagine we've missed
some things already discussed there.

You can use the doc URL

 https://goo.gl/gh2Mzc

to view the latest version and comment.

Thoughts welcome, as always.

Git hash function transition

Status: Draft
Last Updated: 2017-03-03

Objective
-
Migrate Git from SHA-1 to a stronger hash function.

Background
--
The Git version control system can be thought of as a content
addressable filesystem. It uses the SHA-1 hash function to name
content. For example, files, trees, commits are referred to by hash
values unlike in other traditional version control systems where files
or versions are referred to via sequential numbers. The use of a hash
function to address its content delivers a few advantages:

* Integrity checking is easy. Bit flips, for example, are easily
  detected, as the hash of corrupted content does not match its name.
  Lookup of objects is fast.

Using a cryptographically secure hash function brings additional advantages:

* Object names can be signed and third parties can trust the hash to
  address the signed object and all objects it references.
* Communication using Git protocol and out of band communication
  methods have a short reliable string that can be used to reliably
  address stored content.

Over time some flaws in SHA-1 have been discovered by security
researchers. https://shattered.io demonstrated a practical SHA-1 hash
collision. As a result, SHA-1 cannot be considered cryptographically
secure any more. This impacts the communication of hash values because
we cannot trust that a given hash value represents the known good
version of content that the speaker intended.

SHA-1 still possesses the other properties such as fast object lookup
and safe error checking, but other hash functions are equally suitable
that are believed to be cryptographically secure.

Goals
-
1. The transition to SHA256 can be done one local repository at a time.
   a. Requiring no action by any other party.
   b. A SHA256 repository can communicate with SHA-1 Git servers and
  clients (push/fetch).
   c. Users can use SHA-1 and SHA256 identifiers for objects
  interchangeably.
   d. New signed objects make use of a stronger hash function than
  SHA-1 for their security guarantees.
2. Allow a complete transition away from SHA-1.
   a. Local metadata for SHA-1 compatibility can be dropped in a
  repository if compatibility with SHA-1 is no longer needed.
3. Maintainability throughout the process.
   a. The object format is kept simple and consistent.
   b. Creation of a generalized repository conversion tool.

Non-Goals
-
1. Add SHA256 support to Git protocol. This is valuable and the
   logical next step but it is out of scope for this initial design.
2. Transparently improving the security of existing SHA-1 signed
   objects.
3. Intermixing objects using multiple hash functions in a single
   repository.
4. Taking the opportunity to fix other bugs in git's formats and
   protocols.
5. Shallow clones and fetches into a SHA256 repository. (This will
   change when we add SHA256 support to Git protocol.)
6. Skip fetching some submodules of a project into a SHA256
   repository. (This also depends on SHA256 support in Git protocol.)

Overview

We introduce a new repository format extension `sha256`. Repositories
with this extension enabled use SHA256 instead of SHA-1 to name their
objects. This affects both object names and object content --- both
the names of objects and all references to other objects within an
object are switched to the new hash function.

sha256 repositories cannot be read by older versions of Git.

Alongside the packfile, a sha256 stores a bidirectional mapping
between sha256 and sha1 object names. The mapping is generated locally
and can be verified using "git fsck". Object lookups use this mapping
to allow naming objects using either their sha1 and sha256 names
interchangeably.

"git cat-file" and "git hash-object" gain options to display a sha256
object in its sha1 form and write a sha256 object given its sha1 form.
This requires all objects referenced by that object to be present in
the object database so that they can be named using the appropriate
name (using the bidirectional hash mapping).

Fetches from a SHA-1 based server convert the fetched objects into
sha256 form and record the mapping in the bidirectional mapping table
(see below for