Re: [PATCH 18/41] index-pack: abstract away hash function constant
On Fri, Apr 27, 2018 at 11:08 PM, brian m. carlson wrote: > On Thu, Apr 26, 2018 at 05:46:28PM +0200, Duy Nguyen wrote: >> On Wed, Apr 25, 2018 at 8:49 PM, Martin Ågren wrote: >> > Once that is accomplished, I sort of suspect that this code will want to >> > be updated to not always blindly use the_hash_algo, but to always work >> > with SHA-1 sizes. Or rather, this would turn into more generic code to >> > handle both "v2 with SHA-1" and "v3 with some hash function(s)". This >> > commit might be a good first step in that direction. >> >> I also have an uneasy feeling when things this close to on-disk file >> format get hash-agnostic treatment. I think we would need to start >> adding new file formats soon, from bottom up with simple things like >> loose object files (cat-file and hash-object should be enough to test >> blobs...), then moving up to pack files and more. This is when we can >> really decide where to use the new hash and whether we should keep >> some hashes as sha-1. > > I agree that this is work which needs to be done soon. There are > basically a couple of pieces we need to handle NewHash: > > * Remove the dependencies on SHA-1 as much as possible. > * Get the tests to pass with a different hash (almost done for 160-bit > hash; in progress for 256-bit hashes). This step sounds good on paper but realistically could be a nightmare for you. I tried to implement a simple cat-file/hash-object combination with my imaginary newhash, which sounded straightforward to me since you have done a lot of heavylifting. To my surprise I hit a lot more problems. My point is, when I concentrate on just a few simple cases like this, I have a smaller scope to work with and could quickly identify problems. When you work on the scale of the test suite, it's really hard to know where the problem is (and you don't even know what areas are newhash-safe). Anyway my cat-file/hash-object modification could be found here. I probably will polish and send out a few good patches from it. https://github.com/pclouds/git/commits/new-hash -- Duy
Re: [PATCH 18/41] index-pack: abstract away hash function constant
On Thu, Apr 26, 2018 at 05:46:28PM +0200, Duy Nguyen wrote: > On Wed, Apr 25, 2018 at 8:49 PM, Martin Ågren wrote: > > Once that is accomplished, I sort of suspect that this code will want to > > be updated to not always blindly use the_hash_algo, but to always work > > with SHA-1 sizes. Or rather, this would turn into more generic code to > > handle both "v2 with SHA-1" and "v3 with some hash function(s)". This > > commit might be a good first step in that direction. > > I also have an uneasy feeling when things this close to on-disk file > format get hash-agnostic treatment. I think we would need to start > adding new file formats soon, from bottom up with simple things like > loose object files (cat-file and hash-object should be enough to test > blobs...), then moving up to pack files and more. This is when we can > really decide where to use the new hash and whether we should keep > some hashes as sha-1. I agree that this is work which needs to be done soon. There are basically a couple of pieces we need to handle NewHash: * Remove the dependencies on SHA-1 as much as possible. * Get the tests to pass with a different hash (almost done for 160-bit hash; in progress for 256-bit hashes). * Write pack code. * Write loose object index code. * Write read-as-SHA-1 code. * Force the codebase to always use SHA-1 when dealing with fetch/push. * Distinguish between code which needs to use compatObjectFormat and code which needs to use objectFormat. * Decide on NewHash. I'm working on the top two bullet points right now. Others are welcome to pick up other pieces, or I'll get to them eventually. As much as I'm dreading having the bikeshedding discussion over what we're going to pick for NewHash, some of these pieces require knowing what algorithm it will be. For example, we have some tests which either need to be completely rewritten or have a translation table written for them (think the ones that use colliding short names). In order for those tests to have the translation table written, we need to be able to compute colliding values. I'm annotating these with prerequisites, but there are quite a few tests which are skipped. I expect writing the pack, loose object index, and read-as-SHA-1 code is going to require having some code for NewHash or stand-in present in order for it to compile and be tested. It's possible that others could come up with more imaginative solutions that don't require that, but I have my doubts. > For trailing hashes for example, there's no need to move to a new hash > which only costs us more cycles. We just use it as a fancy checksum to > avoid bit flips. But then my assumption about cost may be completely > wrong without experimenting. I would argue that consistency is helpful. Also, do we really want people to be able to (eventually) create colliding packs that contain different data? That doesn't seem like a good idea. But also, some of the candidates we're considering for NewHash are actually faster than SHA-1. So for performance reasons alone, it might be useful to adopt a consistent scheme. -- brian m. carlson: Houston, Texas, US OpenPGP: https://keybase.io/bk2204 signature.asc Description: PGP signature
Re: [PATCH 18/41] index-pack: abstract away hash function constant
On Wed, Apr 25, 2018 at 8:49 PM, Martin Ågren wrote: >> I agree that pack v2 is not going to have anything but SHA-1. However, >> writing all the code such that it's algorithm agnostic means that we can >> do testing of new algorithms by wholesale replacing the algorithm with a >> new one, which simplifies things considerably. > > Ok. I do sort of wonder if a "successful" test run after globally > substituting Hash-Foo for SHA-1 (regardless of whether the size changes > or not) hints at a problem. That is, nowhere do we test that this code > uses 20-byte SHA-1s, regardless of what other hash functions are > available and configured. Of course, until soon, that did not really > have to be tested since there was only one hash function available to > choose from. As for identifying all the places that matter ... no idea. > > Of course I can see how this helps get things to a point where Git does > not crash and burn because the hash has a different size, and where the > test suite doesn't spew failures because the initial chaining value of > "SHA-1" is changed. > > Once that is accomplished, I sort of suspect that this code will want to > be updated to not always blindly use the_hash_algo, but to always work > with SHA-1 sizes. Or rather, this would turn into more generic code to > handle both "v2 with SHA-1" and "v3 with some hash function(s)". This > commit might be a good first step in that direction. I also have an uneasy feeling when things this close to on-disk file format get hash-agnostic treatment. I think we would need to start adding new file formats soon, from bottom up with simple things like loose object files (cat-file and hash-object should be enough to test blobs...), then moving up to pack files and more. This is when we can really decide where to use the new hash and whether we should keep some hashes as sha-1. For trailing hashes for example, there's no need to move to a new hash which only costs us more cycles. We just use it as a fancy checksum to avoid bit flips. But then my assumption about cost may be completely wrong without experimenting. > Long rambling short, yeah, I see your point. So yeah. It may be ok to move everything to "new hash" now. But we need a closer look soon. -- Duy
Re: [PATCH 18/41] index-pack: abstract away hash function constant
On 25 April 2018 at 01:51, brian m. carlson wrote: > On Tue, Apr 24, 2018 at 11:50:16AM +0200, Martin Ågren wrote: >> On 24 April 2018 at 01:39, brian m. carlson >> wrote: >> > The code for reading certain pack v2 offsets had a hard-coded 5 >> > representing the number of uint32_t words that we needed to skip over. >> > Specify this value in terms of a value from the_hash_algo. >> > >> > Signed-off-by: brian m. carlson >> > --- >> > builtin/index-pack.c | 3 ++- >> > 1 file changed, 2 insertions(+), 1 deletion(-) >> > >> > diff --git a/builtin/index-pack.c b/builtin/index-pack.c >> > index d81473e722..c1f94a7da6 100644 >> > --- a/builtin/index-pack.c >> > +++ b/builtin/index-pack.c >> > @@ -1543,12 +1543,13 @@ static void read_v2_anomalous_offsets(struct >> > packed_git *p, >> > { >> > const uint32_t *idx1, *idx2; >> > uint32_t i; >> > + const uint32_t hashwords = the_hash_algo->rawsz / sizeof(uint32_t); >> >> Should we round up? Or just what should we do if a length is not >> divisible by 4? (I am not aware of any such hash functions, but one >> could exist for all I know.) Another question is whether such an >> index-pack v2 will ever contain non-SHA-1 oids to begin with. I can't >> find anything suggesting that it could, but this is unfamiliar code to >> me. > > I opted not to simply because I know that our current hash is 20 bytes > and the new one will be 32, and I know those are both divisible by 4. I > feel confident that any future hash we choose will also be divisible by > 4, and the code is going to be complicated if it isn't. > > I agree that pack v2 is not going to have anything but SHA-1. However, > writing all the code such that it's algorithm agnostic means that we can > do testing of new algorithms by wholesale replacing the algorithm with a > new one, which simplifies things considerably. Ok. I do sort of wonder if a "successful" test run after globally substituting Hash-Foo for SHA-1 (regardless of whether the size changes or not) hints at a problem. That is, nowhere do we test that this code uses 20-byte SHA-1s, regardless of what other hash functions are available and configured. Of course, until soon, that did not really have to be tested since there was only one hash function available to choose from. As for identifying all the places that matter ... no idea. Of course I can see how this helps get things to a point where Git does not crash and burn because the hash has a different size, and where the test suite doesn't spew failures because the initial chaining value of "SHA-1" is changed. Once that is accomplished, I sort of suspect that this code will want to be updated to not always blindly use the_hash_algo, but to always work with SHA-1 sizes. Or rather, this would turn into more generic code to handle both "v2 with SHA-1" and "v3 with some hash function(s)". This commit might be a good first step in that direction. Long rambling short, yeah, I see your point. Martin
Re: [PATCH 18/41] index-pack: abstract away hash function constant
On Tue, Apr 24, 2018 at 11:50:16AM +0200, Martin Ågren wrote: > On 24 April 2018 at 01:39, brian m. carlson > wrote: > > The code for reading certain pack v2 offsets had a hard-coded 5 > > representing the number of uint32_t words that we needed to skip over. > > Specify this value in terms of a value from the_hash_algo. > > > > Signed-off-by: brian m. carlson > > --- > > builtin/index-pack.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/builtin/index-pack.c b/builtin/index-pack.c > > index d81473e722..c1f94a7da6 100644 > > --- a/builtin/index-pack.c > > +++ b/builtin/index-pack.c > > @@ -1543,12 +1543,13 @@ static void read_v2_anomalous_offsets(struct > > packed_git *p, > > { > > const uint32_t *idx1, *idx2; > > uint32_t i; > > + const uint32_t hashwords = the_hash_algo->rawsz / sizeof(uint32_t); > > Should we round up? Or just what should we do if a length is not > divisible by 4? (I am not aware of any such hash functions, but one > could exist for all I know.) Another question is whether such an > index-pack v2 will ever contain non-SHA-1 oids to begin with. I can't > find anything suggesting that it could, but this is unfamiliar code to > me. I opted not to simply because I know that our current hash is 20 bytes and the new one will be 32, and I know those are both divisible by 4. I feel confident that any future hash we choose will also be divisible by 4, and the code is going to be complicated if it isn't. I agree that pack v2 is not going to have anything but SHA-1. However, writing all the code such that it's algorithm agnostic means that we can do testing of new algorithms by wholesale replacing the algorithm with a new one, which simplifies things considerably. -- brian m. carlson: Houston, Texas, US OpenPGP: https://keybase.io/bk2204 signature.asc Description: PGP signature
Re: [PATCH 18/41] index-pack: abstract away hash function constant
On 24 April 2018 at 01:39, brian m. carlson wrote: > The code for reading certain pack v2 offsets had a hard-coded 5 > representing the number of uint32_t words that we needed to skip over. > Specify this value in terms of a value from the_hash_algo. > > Signed-off-by: brian m. carlson > --- > builtin/index-pack.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/builtin/index-pack.c b/builtin/index-pack.c > index d81473e722..c1f94a7da6 100644 > --- a/builtin/index-pack.c > +++ b/builtin/index-pack.c > @@ -1543,12 +1543,13 @@ static void read_v2_anomalous_offsets(struct > packed_git *p, > { > const uint32_t *idx1, *idx2; > uint32_t i; > + const uint32_t hashwords = the_hash_algo->rawsz / sizeof(uint32_t); Should we round up? Or just what should we do if a length is not divisible by 4? (I am not aware of any such hash functions, but one could exist for all I know.) Another question is whether such an index-pack v2 will ever contain non-SHA-1 oids to begin with. I can't find anything suggesting that it could, but this is unfamiliar code to me. > /* The address of the 4-byte offset table */ > idx1 = (((const uint32_t *)p->index_data) > + 2 /* 8-byte header */ > + 256 /* fan out */ > - + 5 * p->num_objects /* 20-byte SHA-1 table */ > + + hashwords * p->num_objects /* object ID table */ > + p->num_objects /* CRC32 table */ > );