Re: [PATCH v4] technical doc: add a design doc for hash function transition
Jonathan Niederwrites: > +Alternatives considered > +--- > +Upgrading everyone working on a particular project on a flag day > + > ... > +Using hash functions in parallel > + > ... Good that we are not doing these ;-) > +Lazily populated translation table > +~~ > +Some of the work of building the translation table could be deferred to > +push time, but that would significantly complicate and slow down pushes. > +Calculating the sha1-name at object creation time at the same time it is > +being streamed to disk and having its newhash-name calculated should be > +an acceptable cost. And the version described in the body of the document hopefully would be simpler. It certainly would be, when SHA-1 content and NewHash content are the same (i.e. blob). THanks.
Re: [PATCH v4] technical doc: add a design doc for hash function transition
On Tue, Oct 03, 2017 at 02:40:26PM +0900, Junio C Hamano wrote: > Jonathan Niederwrites: ... > > +Meaning of signatures > > +~ > > +The signed payload for signed commits and tags does not explicitly > > +name the hash used to identify objects. If some day Git adopts a new > > +hash function with the same length as the current SHA-1 (40 > > +hexadecimal digit) or NewHash (64 hexadecimal digit) objects then the > > +intent behind the PGP signed payload in an object signature is > > +unclear: > > + > > + object e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 > > + type commit > > + tag v2.12.0 > > + tagger Junio C Hamano 1487962205 -0800 > > + > > + Git 2.12 > > + > > +Does this mean Git v2.12.0 is the commit with sha1-name > > +e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with > > +new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7? > > + > > +Fortunately NewHash and SHA-1 have different lengths. If Git starts > > +using another hash with the same length to name objects, then it will > > +need to change the format of signed payloads using that hash to > > +address this issue. > > This is not just signatures, is it? The reference to parent commits > and its tree in a commit object would also have ambiguity between > SHA-1 and new-40-digit-hash. And the "no mixed repository" rule > resolved that for us---isn't that sufficient for the signed tag (or > commit), too? If such a signed-tag appears in a SHA-1 content of a > tag, then the "object" reference is made with SHA-1. If the tag is > in NewHash40 content, "object" reference is made with NewHash40, no? I do hope we adhere to "no mixed repository" rule. Or, at least, "no mixing of hash types". Ambiguity opens cracks for uncertainty to creep in. For our case, where we counter-hash the sha1 commits, and counter-sign the sha1-based signatures, we intend to include the relevant sha1<->newhash lookups in the newhash signature body. afaict, the git sha1<->newhash table is not cryptographically secured underneath signatures, and thus can't be used in the verification of objects. The advantage to this approach is that we can be as explicit as necessary with "SHA-1 -> SHA-512/256" or "SHA-1 -> SHA3-256" in the body of the message. thx, Jason.
Re: [PATCH v4] technical doc: add a design doc for hash function transition
Jonathan Niederwrites: > +Signed Tags > +~~~ > +We add a new field "gpgsig-newhash" to the tag object format to allow > +signing tags without relying on SHA-1. Its signed payload is the > +newhash-content of the tag with its gpgsig-newhash field and "-BEGIN PGP > +SIGNATURE-" delimited in-body signature removed. > + > +This means tags can be signed > +1. using SHA-1 only, as in existing signed tag objects > +2. using both SHA-1 and NewHash, by using gpgsig-newhash and an in-body > + signature. > +3. using only NewHash, by only using the gpgsig-newhash field. I have the same issue with signed commit. The signed parts for SHA-1 contents exclude the in-body signature (obviously) and all the headers including gpgsig-newhash that is not known to our old clients are included. The signed parts for NewHash contents exclude the in-body signature and gpgsig-newhash header, but all other headers. I somehow feel that we should just reserve gpgsig-* to prepare for the day when we introduce newhash2 and later and exclude all of them from the computation. Treat the difference between how SHA-1 contents excludes _only_ it knows about and how NewHash contents excludes _all_ possible signatures, just like the differece between where SHA-1 and NewHash contents has the signature. That is, yes, we didn't know better when we designed SHA-1 contents, but now we know better and are correcting the mistakes by moving the signature from in-body tail to a header, and by excluding anything gpgsig-*, not just the known ones. > +Mergetag embedding > +~~ > +The mergetag field in the sha1-content of a commit contains the > +sha1-content of a tag that was merged by that commit. > + > +The mergetag field in the newhash-content of the same commit contains the > +newhash-content of the same tag. OK. We do not have a tool that extracts them and creates a tag object, but if such a tool is invented in the future, it would only have to worry about newhash content, as it would be a local operation. Makes sense. > +Submodules > +~~ > +To convert recorded submodule pointers, you need to have the converted > +submodule repository in place. The translation table of the submodule > +can be used to look up the new hash. OK, I earlier commented on a paragraph that I couldn't tell what it was talking about, but this is a lot more understandable. Perhaps the earlier one can be removed? We saw earlier what happens during "fetch". This seems to hint that we would need to do a "recursive" fetch in the bottom-up direction, but without fetching the superproject, you wouldn't know what submodules are needed and from where, so there is a bit of chicken-and-egg problem we need to address, as we further make the design more detailed. > +Loose objects and unreachable objects > +~ > ... > +"git gc --auto" currently waits for there to be 50 packs present > +before combining packfiles. Packing loose objects more aggressively > +may cause the number of pack files to grow too quickly. This can be > +mitigated by using a strategy similar to Martin Fick's exponential > +rolling garbage collection script: > +https://gerrit-review.googlesource.com/c/gerrit/+/35215 Yes, concatenating into the latest pack that still is small may be a reasonable way, as there won't be many good chances to create good deltas anyway until you have blobs and trees at sufficiently numbers of different versions, to do a "quick GC whose only purpose is to keep the number of loose object down". > +To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be > +combined under certain circumstances. If "gc.garbageTtl" is set to > +greater than one day, then packs created within a single calendar day, > +UTC, can be coalesced together. The resulting packfile would have an > +mtime before midnight on that day, so this makes the effective maximum > +ttl the garbageTtl + 1 day. If "gc.garbageTtl" is less than one day, > +then we divide the calendar day into intervals one-third of that ttl > +in duration. Packs created within the same interval can be coalesced > +together. The resulting packfile would have an mtime before the end of > +the interval, so this makes the effective maximum ttl equal to the > +garbageTtl * 4/3. OK. Is the use of mtime essential, or because packs are "write once and from there access read-only", would a timestamp written somewhere in the header or the trailer of the file, if existed, work equally well? Not a strong objection, but a mild suggestion that not relying on mtime may be a good idea (it will keep an accidental / unintended "touch" from keeping garbage alive longer than you want). > +The UNREACHABLE_GARBAGE setting goes in the PSRC field of the pack > +index. More generally, that field indicates where a pack came from: > + > + - 1 (PACK_SOURCE_RECEIVE) for a pack received over the network > + - 2 (PACK_SOURCE_AUTO) for a pack created by a
Re: [PATCH v4] technical doc: add a design doc for hash function transition
On Fri, Sep 29, 2017 at 10:34:13AM -0700, Jonathan Nieder wrote: > Junio C Hamano wrote: > > Jonathan Niederwrites: ... > > If it is a goal to eventually be able to lose SHA-1 compatibility > > metadata from the objects, then we might want to remove SHA-1 based > > signature bits (e.g. PGP trailer in signed tag, gpgsig header in the > > commit object) from NewHash contents, and instead have them stored > > in a side "metadata" table, only to be used while converting back. > > I dunno if that is desirable. > > I don't consider that desirable. > > A SHA-1 based signature is still of historical interest even if my > centuries-newer version of Git is not able to verify it. Agreed, even a signature made by a now exposed and revoked key still has validity. Especially in a commit or merge. We know it was made prior to the key being compromised / revoked. This is assuming that the keyholder can definitively say "Don't trust signatures from this key after this date/time+". And the signature in question is in the git history prior to that cut off. Tags are a different animal because they can be added at any time and aren't directly incorporated into the history. thx, Jason.
Re: [PATCH v4] technical doc: add a design doc for hash function transition
Hi Jonathan, On Wed, Sep 27, 2017 at 09:43:21PM -0700, Jonathan Nieder wrote: > This document describes what a transition to a new hash function for > Git would look like. Add it to Documentation/technical/ as the plan > of record so that future changes can be recorded as patches. > > Also-by: Brandon Williams> Also-by: Jonathan Tan > Also-by: Stefan Beller > Signed-off-by: Jonathan Nieder > --- > On Thu, Mar 09, 2017 at 11:14 AM, Shawn Pearce wrote: > > On Mon, Mar 6, 2017 at 4:17 PM, Jonathan Nieder wrote: > > >> Thanks for the kind words on what had quite a few flaws still. Here's > >> a new draft. I think the next version will be a patch against > >> Documentation/technical/. > > > > FWIW, I like this approach. > > Okay, here goes. > > Instead of sharding the loose object translation tables by first byte, > we went for a single table. It simplifies the design and we need to > keep the number of loose objects under control anyway. > > We also included a description of the transition plan and tried to > include a summary of what has been agreed upon so far about the choice > of hash function. > > Thanks to Junio for reviving the discussion and in particular to Dscho > for pushing this forward and making the missing pieces clearer. > > Thoughts of all kinds welcome, as always. > > Documentation/Makefile | 1 + > .../technical/hash-function-transition.txt | 797 > + > 2 files changed, 798 insertions(+) > create mode 100644 Documentation/technical/hash-function-transition.txt > ... > diff --git a/Documentation/technical/hash-function-transition.txt > b/Documentation/technical/hash-function-transition.txt > new file mode 100644 > index 00..417ba491d0 > --- /dev/null > +++ b/Documentation/technical/hash-function-transition.txt > @@ -0,0 +1,797 @@ > +Git hash function transition > + > + > +Objective > +- > +Migrate Git from SHA-1 to a stronger hash function. > + ... > +Goals > +- > +Where NewHash is a strong 256-bit hash function to replace SHA-1 (see > +"Selection of a New Hash", below): Could we clarify and say "a strong hash function with 256-bit output"? ... > +Overview > + > +We introduce a new repository format extension. Repositories with this > +extension enabled use NewHash instead of SHA-1 to name their objects. > +This affects both object names and object content --- both the names > +of objects and all references to other objects within an object are > +switched to the new hash function. > + > +NewHash repositories cannot be read by older versions of Git. > + > +Alongside the packfile, a NewHash repository stores a bidirectional > +mapping between NewHash and SHA-1 object names. The mapping is generated > +locally and can be verified using "git fsck". Object lookups use this > +mapping to allow naming objects using either their SHA-1 and NewHash names > +interchangeably. nit: Are we presuming that abbreviated hashes won't collide? Or the user needs to specify which hash type? > +Object format > +~ > +The content as a byte sequence of a tag, commit, or tree object named > +by sha1 and newhash differ because an object named by newhash-name refers to > +other objects by their newhash-names and an object named by sha1-name > +refers to other objects by their sha1-names. > + > +The newhash-content of an object is the same as its sha1-content, except > +that objects referenced by the object are named using their newhash-names > +instead of sha1-names. Because a blob object does not refer to any > +other object, its sha1-content and newhash-content are the same. > + > +The format allows round-trip conversion between newhash-content and > +sha1-content. It would be nice here to explicitly mention deterministic hashing. Meaning that anyone who converts a commit from sha1 to newhash shall get the same newhash. > + > +Object storage > +~~ > +Loose objects use zlib compression and packed objects use the packed > +format described in Documentation/technical/pack-format.txt, just like > +today. The content that is compressed and stored uses newhash-content > +instead of sha1-content. > + > +Pack index > +~~ > +Pack index (.idx) files use a new v3 format that supports multiple > +hash functions. They have the following format (all integers are in > +network byte order): > + > +- A header appears at the beginning and consists of the following: > + - The 4-byte pack index signature: '\377t0c' > + - 4-byte version number: 3 > + - 4-byte length of the header section, including the signature and > +version number > + - 4-byte number of objects contained in the pack > + - 4-byte number of object formats in this pack index: 2 > + - For each object format: > +- 4-byte format identifier (e.g., 'sha1' for SHA-1) This seems a
Re: [PATCH v4] technical doc: add a design doc for hash function transition
Jonathan Niederwrites: > +Reading an object's sha1-content > + > +The sha1-content of an object can be read by converting all newhash-names > +its newhash-content references to sha1-names using the translation table. Sure. > +Fetch > +~ > +Fetching from a SHA-1 based server requires translating between SHA-1 > +and NewHash based representations on the fly. > + > +SHA-1s named in the ref advertisement that are present on the client > +can be translated to NewHash and looked up as local objects using the > +translation table. > + > +Negotiation proceeds as today. Any "have"s generated locally are > +converted to SHA-1 before being sent to the server, and SHA-1s > +mentioned by the server are converted to NewHash when looking them up > +locally. Any of our alternate object store by definition is a NewHash repository--otherwise we'd violate "no mixing" rule. It may or may note have the translation table for its objects. If it no longer has the translation table (because it migrated to NewHash only world before we did), then we can still use it as our alternate but we cannot use it for the purpose of common ancestore discovery. > +After negotiation, the server sends a packfile containing the > +requested objects. s/objects.$/& These are all SHA-1 contents./ > +We convert the packfile to NewHash format using > +the following steps: > + > +1. index-pack: inflate each object in the packfile and compute its > + SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against > + objects the client has locally. These objects can be looked up > + using the translation table and their sha1-content read as > + described above to resolve the deltas. That procedure would give us the object's SHA-1 contents for ref-delta objects. For an ofs-delta object, by definition, its base object should appear in the same packstream, so we should eventually be able to get to the SHA-1 contents of the delta base, and from there we can apply the delta to obtain the SHA-1 contents. For a non-delta object, we already have its SHA-1 contents in the packstream. So we can get SHA-1 names and SHA-1 contents of each and every object in the packstream in this step. Are we actually writing out a .pack/.idx pair that is usable in the SHA-1 world at this stage? Or are we going to read from something we keep in-core in the step #3 below? > +2. topological sort: starting at the "want"s from the negotiation > + phase, walk through objects in the pack and emit a list of them, > + excluding blobs, in reverse topologically sorted order, with each > + object coming later in the list than all objects it references. > + (This list only contains objects reachable from the "wants". If the > + pack from the server contained additional extraneous objects, then > + they will be discarded.) Presumably this is a list of SHA-1 names, as we do not yet have enough information to compute NewHash names yet at this point. May want to spell it out here. Would it discard the auto-followed tags if we do the "traverse from wants only"? Traversing the objects in the packfile to find the "tips" that are not referenced from any other object in the pack might be necessary, and it shouldn't be too costly, I'd guess. > +3. convert to newhash: open a new (newhash) packfile. Read the topologically > + sorted list just generated. For each object, inflate its > + sha1-content, convert to newhash-content, and write it to the newhash > + pack. Record the new sha1<->newhash mapping entry for use in the idx. Are we doing any deltification here? If we are computing .pack/.idx pair that can be usable in the SHA-1 world in step #1, then reusing blob deltas should be trivial (a good delta-base in the SHA-1 world is a good delta-base in the NewHash world, too). Things that have outgoing references like trees, it might be possible that such a heuristic may not give us the absolute best delta-base, but I guess it would still be a good approximation to reuse the delta/base object relationship in SHA-1 world to NewHash world, assuming that the server did a good job choosing the bases. > +4. sort: reorder entries in the new pack to match the order of objects > + in the pack the server generated and include blobs. Write a newhash idx > + file OK. > +5. clean up: remove the SHA-1 based pack file, index, and > + topologically sorted list obtained from the server in steps 1 > + and 2. Ah, OK, so we do write the SHA_1 pack/idx in the first step. OK. > +Push > + > +Push is simpler than fetch because the objects referenced by the > +pushed objects are already in the translation table. The sha1-content > +of each object being pushed can be read as described in the "Reading > +an object's sha1-content" section to generate the pack written by git > +send-pack. OK. > +Signed Commits > +~~ > +We add a new field "gpgsig-newhash" to the commit object format to allow >
Re: [PATCH v4] technical doc: add a design doc for hash function transition
Jonathan Niederwrites: >>> +6. Skip fetching some submodules of a project into a NewHash >>> + repository. (This also depends on NewHash support in Git >>> + protocol.) >> >> It is unclear what this means. Around submodule support, one thing >> I can think of is that a NewHash tree in a superproject would record >> a gitlink that is a NewHash commit object name in it, therefore it >> cannot refer to an unconverted SHA-1 submodule repository. But it >> is unclear if the above description refers to the same issue, or >> something else. > > It refers to that issue. We may want to find a way to make it clear, then. >> It makes me wonder if we want to add the hashname in this object >> header. "length" would be different for non-blob objects anyway, >> and it is not "compat metadata" we want to avoid baked in, yet it >> would help diagnose a mistake of attempting to use a "mixed" objects >> in a single repository. Not a big issue, though. > > Do you mean that adding the hashname into the computation that > produces the object name would help in some use case? What I mean is that for SHA-1 objects we keep the object header to be " NUL". For objects in newer world, use the object header to " NUL", and include the hashname in the object name computation. > For loose objects, it would be nice to name the hash in the file, so > that "file" can understand what is happening if someone accidentally > mixes types using "cp". The only downside is losing the ability to > copy blobs (which have the same content despite being named using > different hashes) between repositories after determining their new > names. That doesn't seem like a strong downside --- it's pretty > harmless to include the hash type in loose object files, too. I think > I would prefer this to be a "magic number" instead of part of the > zlib-deflated payload, since this way "file" can discover it more > easily. Yeah, thanks for doing pros-and-cons for me ;-) >> If it is a goal to eventually be able to lose SHA-1 compatibility >> metadata from the objects, then we might want to remove SHA-1 based >> signature bits (e.g. PGP trailer in signed tag, gpgsig header in the >> commit object) from NewHash contents, and instead have them stored >> in a side "metadata" table, only to be used while converting back. >> I dunno if that is desirable. > > I don't consider that desirable. Agreed. Let's not go there. >> Hmm, as the corresponding packfile stores object data only in >> NewHash content format, it is somewhat curious that this table that >> stores CRC32 of the data appears in the "Tables for each object >> format" section, as they would be identical, no? Unless I am >> grossly misleading the spec, the checksum should either go outside >> the "Tables for each object format" section but still in .idx, or >> should be eliminated and become part of the packdata stream instead, >> perhaps? > > It's actually only present for the first object format. Will find a > better way to describe this. I see. One way to do so is to have it upfront before the "after this point, these tables repeat for each of the hashes" part of the file. >> Oy. So we can go from a short prefix to the pack location by first >> finding it via binsearch in the short-name table, realize that it is >> nth object in the object name order, and consulting this table. >> When we know the pack-order of an object, there is no direct way to >> go to its location (short of reversing the name-order-to-pack-order >> table)? > > An earlier version of the design also had a pack-order-to-pack-offset > table, but we weren't able to think of any cases where that would be > used without also looking up the object name that can be used to > verify the integrity of the inflated object. The primary thing I was interested in knowing was if we tried to think of any case where it may be useful and then didn't think of any---I couldn't but I know I am not imaginative enough, and I wanted to know you guys didn't, either.
Re: [PATCH v4] technical doc: add a design doc for hash function transition
Junio C Hamano wrote: > Jonathan Niederwrites: >> This document describes what a transition to a new hash function for >> Git would look like. Add it to Documentation/technical/ as the plan >> of record so that future changes can be recorded as patches. >> >> Also-by: Brandon Williams >> Also-by: Jonathan Tan >> Also-by: Stefan Beller >> Signed-off-by: Jonathan Nieder >> --- > > Shoudln't these all be s-o-b: (with a note immediately before that > to say all four contributed equally or something)? I don't want to get lost in the weeds in the question of how to represent such a collaborative effort in git's metadata. You're right that I should collect their sign-offs! Your approach of using text instead of machine-readable data for common authorship also seems okay. In any event, this is indeed Signed-off-by: Brandon Williams Signed-off-by: Jonathan Tan Signed-off-by: Stefan Beller (I just checked :)). >> +Background >> +-- >> +At its core, the Git version control system is a content addressable >> +filesystem. It uses the SHA-1 hash function to name content. For >> +example, files, directories, and revisions are referred to by hash >> +values unlike in other traditional version control systems where files >> +or versions are referred to via sequential numbers. The use of a hash > > Traditional systems refer to files via numbers??? Perhaps "where > versions of files are referred to via sequential numbers" or > something? Good point. The wording you suggested will work well. >> +function to address its content delivers a few advantages: >> + >> +* Integrity checking is easy. Bit flips, for example, are easily >> + detected, as the hash of corrupted content does not match its name. >> +* Lookup of objects is fast. > > * There is no ambiguity what the object's name should be, given its > content. > > * Deduping the same content copied across versions and paths is > automatic. :) Yep, these are nice too, especially that second one. It also is how we make diff-ing fast. >> +SHA-1 still possesses the other properties such as fast object lookup >> +and safe error checking, but other hash functions are equally suitable >> +that are believed to be cryptographically secure. > > s/secure/more &/, perhaps? We were looking for a phrase meaning that it should be a cryptographic hash function in good standing, which SHA-1 is at least approaching not being. "more secure" should work fine. Let's go with that. >> +Goals >> +- >> +... >> + c. Users can use SHA-1 and NewHash identifiers for objects >> + interchangeably (see "Object names on the command line", below). > > Mental note. This needs to extend to the "index X..Y" lines in the > patch output, which is used by "apply -3" and "am -3". Will add a note about this to "Object names on the command line". Stefan had already pointed out that that section should really be renamed to something like "Object names in input and output". >> +2. Allow a complete transition away from SHA-1. >> + a. Local metadata for SHA-1 compatibility can be removed from a >> + repository if compatibility with SHA-1 is no longer needed. > > I like the emphasis on "Local" here. Metadata for compatiblity that > is embedded in the objects obviously cannot be removed. > > From that point of view, one of the goals ought to be "make sure > that as much SHA-1 compatibility metadata as possible is local and > outside the object". This goal may not be able to say more than "as > much as possible", as signed objects that came from SHA-1 world > needs to carry the compatibility metadata somewhere somehow. > > Or perhaps we could. There is nothing that says a signed tag > created in the SHA-1 world must have the PGP/SHA-1 signature in the > NewHash payload---it could be split off of the object data and > stored in a local metadata cache, to be used only when we need to > convert it back to the SHA-1 world. That would break round-tripping and would mean that multiple SHA-1 objects could have the same NewHash name. In other words, from my point of view there is something that says that such data must be preserved. Another way to put it: even after removing all SHA-1 compatibility metadata, one nice feature of this design is that it can be recovered if I change my mind, from data in the NewHash based repository alone. [...] >> +Non-Goals >> +- >> ... >> +6. Skip fetching some submodules of a project into a NewHash >> + repository. (This also depends on NewHash support in Git >> + protocol.) > > It is unclear what this means. Around submodule support, one thing > I can think of is that a NewHash tree in a superproject would record > a gitlink that is a NewHash commit object name in it, therefore it > cannot refer to an unconverted SHA-1 submodule repository. But
Re: [PATCH v4] technical doc: add a design doc for hash function transition
Junio C Hamanowrites: > Or perhaps we could. There is nothing that says a signed tag > created in the SHA-1 world must have the PGP/SHA-1 signature in the > NewHash payload---it could be split off of the object data and > stored in a local metadata cache, to be used only when we need to > convert it back to the SHA-1 world. > ... >> +The format allows round-trip conversion between newhash-content and >> +sha1-content. > > If it is a goal to eventually be able to lose SHA-1 compatibility > metadata from the objects, then we might want to remove SHA-1 based > signature bits (e.g. PGP trailer in signed tag, gpgsig header in the > commit object) from NewHash contents, and instead have them stored > in a side "metadata" table, only to be used while converting back. > I dunno if that is desirable. Let's keep it simple by ignoring all of the above. Even though leaving the sha1-gpgsig and other crufts would etch these compatibility metadata in objects forever, these remain only in objects that originate from SHA-1 world, or in objects created in the NewHash world only while the project participants still care about SHA-1 compatibility. Strictly speaking, it would be super nice if we can do without contaminating these newly created objects with SHA-1 compatibility headers, just like we wish to be able to drop the SHA-1 vs NewHash mapping table after projects participants stop careing about SHA-1 compatiblity, it may not be worth it. Of course, if we decide to spend a bit more brain cycle to design how we push these out of the object proper, the same solution would automatically allow us to omit SHA-1 compatibility headers from the objects that were converted from SHA-1 world. > >> + - A table of 4-byte CRC32 values of the packed object data, in the >> +order that the objects appear in the pack file. This is to allow >> +compressed data to be copied directly from pack to pack during >> +repacking without undetected data corruption. > > An obvious alternative would be to have the CRC32 checksum near > (e.g. immediately before) the object data in the packfile (as > opposed to the .idx file like this document specifies). I am not > sure what the pros and cons are between the two, though, and that is > why I mention the possiblity here. > > Hmm, as the corresponding packfile stores object data only in > NewHash content format, it is somewhat curious that this table that > stores CRC32 of the data appears in the "Tables for each object > format" section, as they would be identical, no? Unless I am > grossly misleading the spec, the checksum should either go outside > the "Tables for each object format" section but still in .idx, or > should be eliminated and become part of the packdata stream instead, > perhaps? Thinking about this a bit more, I think a single table per .idx file would be the right way to go, not a checksum immediately after or before the object data that is embedded in the pack stream. In the NewHash world (after this initial migration), we would want to be able to stream NewHash packstream that comes from the network straight to disk, which would mean these in-line CRC32 data would need to be sent over the wire (i.e. 4-byte per object sent); that is an unneeded overhead, as the packstream has its trailing checksum to protect the whole thing anyway.
Re: [PATCH v4] technical doc: add a design doc for hash function transition
Jonathan Niederwrites: > This document describes what a transition to a new hash function for > Git would look like. Add it to Documentation/technical/ as the plan > of record so that future changes can be recorded as patches. > > Also-by: Brandon Williams > Also-by: Jonathan Tan > Also-by: Stefan Beller > Signed-off-by: Jonathan Nieder > --- Shoudln't these all be s-o-b: (with a note immediately before that to say all four contributed equally or something)? > +Background > +-- > +At its core, the Git version control system is a content addressable > +filesystem. It uses the SHA-1 hash function to name content. For > +example, files, directories, and revisions are referred to by hash > +values unlike in other traditional version control systems where files > +or versions are referred to via sequential numbers. The use of a hash Traditional systems refer to files via numbers??? Perhaps "where versions of files are referred to via sequential numbers" or something? > +function to address its content delivers a few advantages: > + > +* Integrity checking is easy. Bit flips, for example, are easily > + detected, as the hash of corrupted content does not match its name. > +* Lookup of objects is fast. * There is no ambiguity what the object's name should be, given its content. * Deduping the same content copied across versions and paths is automatic. > +SHA-1 still possesses the other properties such as fast object lookup > +and safe error checking, but other hash functions are equally suitable > +that are believed to be cryptographically secure. s/secure/more &/, perhaps? > +Goals > +- > +... > + c. Users can use SHA-1 and NewHash identifiers for objects > + interchangeably (see "Object names on the command line", below). Mental note. This needs to extend to the "index X..Y" lines in the patch output, which is used by "apply -3" and "am -3". > +2. Allow a complete transition away from SHA-1. > + a. Local metadata for SHA-1 compatibility can be removed from a > + repository if compatibility with SHA-1 is no longer needed. I like the emphasis on "Local" here. Metadata for compatiblity that is embedded in the objects obviously cannot be removed. >From that point of view, one of the goals ought to be "make sure that as much SHA-1 compatibility metadata as possible is local and outside the object". This goal may not be able to say more than "as much as possible", as signed objects that came from SHA-1 world needs to carry the compatibility metadata somewhere somehow. Or perhaps we could. There is nothing that says a signed tag created in the SHA-1 world must have the PGP/SHA-1 signature in the NewHash payload---it could be split off of the object data and stored in a local metadata cache, to be used only when we need to convert it back to the SHA-1 world. But I am getting ahead of myself before reading the proposal through. > +Non-Goals > +- > ... > +6. Skip fetching some submodules of a project into a NewHash > + repository. (This also depends on NewHash support in Git > + protocol.) It is unclear what this means. Around submodule support, one thing I can think of is that a NewHash tree in a superproject would record a gitlink that is a NewHash commit object name in it, therefore it cannot refer to an unconverted SHA-1 submodule repository. But it is unclear if the above description refers to the same issue, or something else. > +Overview > + > +We introduce a new repository format extension. Repositories with this > +extension enabled use NewHash instead of SHA-1 to name their objects. > +This affects both object names and object content --- both the names > +of objects and all references to other objects within an object are > +switched to the new hash function. > + > +NewHash repositories cannot be read by older versions of Git. > + > +Alongside the packfile, a NewHash repository stores a bidirectional > +mapping between NewHash and SHA-1 object names. The mapping is generated > +locally and can be verified using "git fsck". Object lookups use this > +mapping to allow naming objects using either their SHA-1 and NewHash names > +interchangeably. > + > +"git cat-file" and "git hash-object" gain options to display an object > +in its sha1 form and write an object given its sha1 form. Both of these are somewhat unclear. I am guessing that "git cat-file --convert-to=sha1 " would emit the object contents converted from their NewHash payload to SHA-1 payload (blobs are unchanged, trees, commits and tags get their outgoing references converted from NewHash to their SHA-1 counterparts), and that is what you mean by "options to display an object in its sha1 form". I am not sure how "git hash-object" with the option would work, though. Do you give an option "--hash=sha1 --stdout --stdin -t " to feed a NewHash contents (file,