Re: [PATCH 18/41] index-pack: abstract away hash function constant

2018-04-27 Thread Duy Nguyen
On Fri, Apr 27, 2018 at 11:08 PM, brian m. carlson
 wrote:
> On Thu, Apr 26, 2018 at 05:46:28PM +0200, Duy Nguyen wrote:
>> On Wed, Apr 25, 2018 at 8:49 PM, Martin Ågren  wrote:
>> > Once that is accomplished, I sort of suspect that this code will want to
>> > be updated to not always blindly use the_hash_algo, but to always work
>> > with SHA-1 sizes. Or rather, this would turn into more generic code to
>> > handle both "v2 with SHA-1" and "v3 with some hash function(s)". This
>> > commit might be a good first step in that direction.
>>
>> I also have an uneasy feeling when things this close to on-disk file
>> format get hash-agnostic treatment. I think we would need to start
>> adding new file formats soon, from bottom up with simple things like
>> loose object files (cat-file and hash-object should be enough to test
>> blobs...), then moving up to pack files and more. This is when we can
>> really decide where to use the new hash and whether we should keep
>> some hashes as sha-1.
>
> I agree that this is work which needs to be done soon.  There are
> basically a couple of pieces we need to handle NewHash:
>
> * Remove the dependencies on SHA-1 as much as possible.
> * Get the tests to pass with a different hash (almost done for 160-bit
>   hash; in progress for 256-bit hashes).

This step sounds good on paper but realistically could be a nightmare for you.

I tried to implement a simple cat-file/hash-object combination with my
imaginary newhash, which sounded straightforward to me since you have
done a lot of heavylifting. To my surprise I hit a lot more problems.
My point is, when I concentrate on just a few simple cases like this,
I have a smaller scope to work with and could quickly identify
problems. When you work on the scale of the test suite, it's really
hard to know where the problem is (and you don't even know what areas
are newhash-safe).

Anyway my cat-file/hash-object modification could be found here. I
probably will polish and send out a few good patches from it.

https://github.com/pclouds/git/commits/new-hash
-- 
Duy


Re: [PATCH 18/41] index-pack: abstract away hash function constant

2018-04-27 Thread brian m. carlson
On Thu, Apr 26, 2018 at 05:46:28PM +0200, Duy Nguyen wrote:
> On Wed, Apr 25, 2018 at 8:49 PM, Martin Ågren  wrote:
> > Once that is accomplished, I sort of suspect that this code will want to
> > be updated to not always blindly use the_hash_algo, but to always work
> > with SHA-1 sizes. Or rather, this would turn into more generic code to
> > handle both "v2 with SHA-1" and "v3 with some hash function(s)". This
> > commit might be a good first step in that direction.
> 
> I also have an uneasy feeling when things this close to on-disk file
> format get hash-agnostic treatment. I think we would need to start
> adding new file formats soon, from bottom up with simple things like
> loose object files (cat-file and hash-object should be enough to test
> blobs...), then moving up to pack files and more. This is when we can
> really decide where to use the new hash and whether we should keep
> some hashes as sha-1.

I agree that this is work which needs to be done soon.  There are
basically a couple of pieces we need to handle NewHash:

* Remove the dependencies on SHA-1 as much as possible.
* Get the tests to pass with a different hash (almost done for 160-bit
  hash; in progress for 256-bit hashes).
* Write pack code.
* Write loose object index code.
* Write read-as-SHA-1 code.
* Force the codebase to always use SHA-1 when dealing with fetch/push.
* Distinguish between code which needs to use compatObjectFormat and
  code which needs to use objectFormat.
* Decide on NewHash.

I'm working on the top two bullet points right now.  Others are welcome
to pick up other pieces, or I'll get to them eventually.

As much as I'm dreading having the bikeshedding discussion over what
we're going to pick for NewHash, some of these pieces require knowing
what algorithm it will be.  For example, we have some tests which either
need to be completely rewritten or have a translation table written for
them (think the ones that use colliding short names).  In order for
those tests to have the translation table written, we need to be able to
compute colliding values.  I'm annotating these with prerequisites, but
there are quite a few tests which are skipped.

I expect writing the pack, loose object index, and read-as-SHA-1 code is
going to require having some code for NewHash or stand-in present in
order for it to compile and be tested.  It's possible that others could
come up with more imaginative solutions that don't require that, but I
have my doubts.

> For trailing hashes for example, there's no need to move to a new hash
> which only costs us more cycles. We just use it as a fancy checksum to
> avoid bit flips. But then my assumption about cost may be completely
> wrong without experimenting.

I would argue that consistency is helpful.  Also, do we really want
people to be able to (eventually) create colliding packs that contain
different data?  That doesn't seem like a good idea.

But also, some of the candidates we're considering for NewHash are
actually faster than SHA-1.  So for performance reasons alone, it might
be useful to adopt a consistent scheme.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature


Re: [PATCH 18/41] index-pack: abstract away hash function constant

2018-04-26 Thread Duy Nguyen
On Wed, Apr 25, 2018 at 8:49 PM, Martin Ågren  wrote:
>> I agree that pack v2 is not going to have anything but SHA-1.  However,
>> writing all the code such that it's algorithm agnostic means that we can
>> do testing of new algorithms by wholesale replacing the algorithm with a
>> new one, which simplifies things considerably.
>
> Ok. I do sort of wonder if a "successful" test run after globally
> substituting Hash-Foo for SHA-1 (regardless of whether the size changes
> or not) hints at a problem. That is, nowhere do we test that this code
> uses 20-byte SHA-1s, regardless of what other hash functions are
> available and configured. Of course, until soon, that did not really
> have to be tested since there was only one hash function available to
> choose from. As for identifying all the places that matter ... no idea.
>
> Of course I can see how this helps get things to a point where Git does
> not crash and burn because the hash has a different size, and where the
> test suite doesn't spew failures because the initial chaining value of
> "SHA-1" is changed.
>
> Once that is accomplished, I sort of suspect that this code will want to
> be updated to not always blindly use the_hash_algo, but to always work
> with SHA-1 sizes. Or rather, this would turn into more generic code to
> handle both "v2 with SHA-1" and "v3 with some hash function(s)". This
> commit might be a good first step in that direction.

I also have an uneasy feeling when things this close to on-disk file
format get hash-agnostic treatment. I think we would need to start
adding new file formats soon, from bottom up with simple things like
loose object files (cat-file and hash-object should be enough to test
blobs...), then moving up to pack files and more. This is when we can
really decide where to use the new hash and whether we should keep
some hashes as sha-1.

For trailing hashes for example, there's no need to move to a new hash
which only costs us more cycles. We just use it as a fancy checksum to
avoid bit flips. But then my assumption about cost may be completely
wrong without experimenting.

> Long rambling short, yeah, I see your point.

So yeah. It may be ok to move everything to "new hash" now. But we
need a closer look soon.
-- 
Duy


Re: [PATCH 18/41] index-pack: abstract away hash function constant

2018-04-25 Thread Martin Ågren
On 25 April 2018 at 01:51, brian m. carlson
 wrote:
> On Tue, Apr 24, 2018 at 11:50:16AM +0200, Martin Ågren wrote:
>> On 24 April 2018 at 01:39, brian m. carlson
>>  wrote:
>> > The code for reading certain pack v2 offsets had a hard-coded 5
>> > representing the number of uint32_t words that we needed to skip over.
>> > Specify this value in terms of a value from the_hash_algo.
>> >
>> > Signed-off-by: brian m. carlson 
>> > ---
>> >  builtin/index-pack.c | 3 ++-
>> >  1 file changed, 2 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/builtin/index-pack.c b/builtin/index-pack.c
>> > index d81473e722..c1f94a7da6 100644
>> > --- a/builtin/index-pack.c
>> > +++ b/builtin/index-pack.c
>> > @@ -1543,12 +1543,13 @@ static void read_v2_anomalous_offsets(struct 
>> > packed_git *p,
>> >  {
>> > const uint32_t *idx1, *idx2;
>> > uint32_t i;
>> > +   const uint32_t hashwords = the_hash_algo->rawsz / sizeof(uint32_t);
>>
>> Should we round up? Or just what should we do if a length is not
>> divisible by 4? (I am not aware of any such hash functions, but one
>> could exist for all I know.) Another question is whether such an
>> index-pack v2 will ever contain non-SHA-1 oids to begin with. I can't
>> find anything suggesting that it could, but this is unfamiliar code to
>> me.
>
> I opted not to simply because I know that our current hash is 20 bytes
> and the new one will be 32, and I know those are both divisible by 4.  I
> feel confident that any future hash we choose will also be divisible by
> 4, and the code is going to be complicated if it isn't.
>
> I agree that pack v2 is not going to have anything but SHA-1.  However,
> writing all the code such that it's algorithm agnostic means that we can
> do testing of new algorithms by wholesale replacing the algorithm with a
> new one, which simplifies things considerably.

Ok. I do sort of wonder if a "successful" test run after globally
substituting Hash-Foo for SHA-1 (regardless of whether the size changes
or not) hints at a problem. That is, nowhere do we test that this code
uses 20-byte SHA-1s, regardless of what other hash functions are
available and configured. Of course, until soon, that did not really
have to be tested since there was only one hash function available to
choose from. As for identifying all the places that matter ... no idea.

Of course I can see how this helps get things to a point where Git does
not crash and burn because the hash has a different size, and where the
test suite doesn't spew failures because the initial chaining value of
"SHA-1" is changed.

Once that is accomplished, I sort of suspect that this code will want to
be updated to not always blindly use the_hash_algo, but to always work
with SHA-1 sizes. Or rather, this would turn into more generic code to
handle both "v2 with SHA-1" and "v3 with some hash function(s)". This
commit might be a good first step in that direction.

Long rambling short, yeah, I see your point.

Martin


Re: [PATCH 18/41] index-pack: abstract away hash function constant

2018-04-24 Thread brian m. carlson
On Tue, Apr 24, 2018 at 11:50:16AM +0200, Martin Ågren wrote:
> On 24 April 2018 at 01:39, brian m. carlson
>  wrote:
> > The code for reading certain pack v2 offsets had a hard-coded 5
> > representing the number of uint32_t words that we needed to skip over.
> > Specify this value in terms of a value from the_hash_algo.
> >
> > Signed-off-by: brian m. carlson 
> > ---
> >  builtin/index-pack.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> > index d81473e722..c1f94a7da6 100644
> > --- a/builtin/index-pack.c
> > +++ b/builtin/index-pack.c
> > @@ -1543,12 +1543,13 @@ static void read_v2_anomalous_offsets(struct 
> > packed_git *p,
> >  {
> > const uint32_t *idx1, *idx2;
> > uint32_t i;
> > +   const uint32_t hashwords = the_hash_algo->rawsz / sizeof(uint32_t);
> 
> Should we round up? Or just what should we do if a length is not
> divisible by 4? (I am not aware of any such hash functions, but one
> could exist for all I know.) Another question is whether such an
> index-pack v2 will ever contain non-SHA-1 oids to begin with. I can't
> find anything suggesting that it could, but this is unfamiliar code to
> me.

I opted not to simply because I know that our current hash is 20 bytes
and the new one will be 32, and I know those are both divisible by 4.  I
feel confident that any future hash we choose will also be divisible by
4, and the code is going to be complicated if it isn't.

I agree that pack v2 is not going to have anything but SHA-1.  However,
writing all the code such that it's algorithm agnostic means that we can
do testing of new algorithms by wholesale replacing the algorithm with a
new one, which simplifies things considerably.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature


Re: [PATCH 18/41] index-pack: abstract away hash function constant

2018-04-24 Thread Martin Ågren
On 24 April 2018 at 01:39, brian m. carlson
 wrote:
> The code for reading certain pack v2 offsets had a hard-coded 5
> representing the number of uint32_t words that we needed to skip over.
> Specify this value in terms of a value from the_hash_algo.
>
> Signed-off-by: brian m. carlson 
> ---
>  builtin/index-pack.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> index d81473e722..c1f94a7da6 100644
> --- a/builtin/index-pack.c
> +++ b/builtin/index-pack.c
> @@ -1543,12 +1543,13 @@ static void read_v2_anomalous_offsets(struct 
> packed_git *p,
>  {
> const uint32_t *idx1, *idx2;
> uint32_t i;
> +   const uint32_t hashwords = the_hash_algo->rawsz / sizeof(uint32_t);

Should we round up? Or just what should we do if a length is not
divisible by 4? (I am not aware of any such hash functions, but one
could exist for all I know.) Another question is whether such an
index-pack v2 will ever contain non-SHA-1 oids to begin with. I can't
find anything suggesting that it could, but this is unfamiliar code to
me.

> /* The address of the 4-byte offset table */
> idx1 = (((const uint32_t *)p->index_data)
> + 2 /* 8-byte header */
> + 256 /* fan out */
> -   + 5 * p->num_objects /* 20-byte SHA-1 table */
> +   + hashwords * p->num_objects /* object ID table */
> + p->num_objects /* CRC32 table */
> );