Re: [PATCH v4] technical doc: add a design doc for hash function transition

2017-10-03 Thread Junio C Hamano
Jonathan Nieder  writes:

> +Alternatives considered
> +---
> +Upgrading everyone working on a particular project on a flag day
> +
> ...
> +Using hash functions in parallel
> +
> ...

Good that we are not doing these ;-)

> +Lazily populated translation table
> +~~
> +Some of the work of building the translation table could be deferred to
> +push time, but that would significantly complicate and slow down pushes.
> +Calculating the sha1-name at object creation time at the same time it is
> +being streamed to disk and having its newhash-name calculated should be
> +an acceptable cost.

And the version described in the body of the document hopefully
would be simpler.  It certainly would be, when SHA-1 content and
NewHash content are the same (i.e. blob).

THanks.


Re: [PATCH v4] technical doc: add a design doc for hash function transition

2017-10-03 Thread Jason Cooper
On Tue, Oct 03, 2017 at 02:40:26PM +0900, Junio C Hamano wrote:
> Jonathan Nieder  writes:
...
> > +Meaning of signatures
> > +~
> > +The signed payload for signed commits and tags does not explicitly
> > +name the hash used to identify objects. If some day Git adopts a new
> > +hash function with the same length as the current SHA-1 (40
> > +hexadecimal digit) or NewHash (64 hexadecimal digit) objects then the
> > +intent behind the PGP signed payload in an object signature is
> > +unclear:
> > +
> > +   object e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7
> > +   type commit
> > +   tag v2.12.0
> > +   tagger Junio C Hamano  1487962205 -0800
> > +
> > +   Git 2.12
> > +
> > +Does this mean Git v2.12.0 is the commit with sha1-name
> > +e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with
> > +new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7?
> > +
> > +Fortunately NewHash and SHA-1 have different lengths. If Git starts
> > +using another hash with the same length to name objects, then it will
> > +need to change the format of signed payloads using that hash to
> > +address this issue.
> 
> This is not just signatures, is it?  The reference to parent commits
> and its tree in a commit object would also have ambiguity between
> SHA-1 and new-40-digit-hash.  And the "no mixed repository" rule
> resolved that for us---isn't that sufficient for the signed tag (or
> commit), too?  If such a signed-tag appears in a SHA-1 content of a
> tag, then the "object" reference is made with SHA-1.  If the tag is
> in NewHash40 content, "object" reference is made with NewHash40, no?

I do hope we adhere to "no mixed repository" rule.  Or, at least, "no
mixing of hash types".  Ambiguity opens cracks for uncertainty to creep
in.

For our case, where we counter-hash the sha1 commits, and counter-sign
the sha1-based signatures, we intend to include the relevant
sha1<->newhash lookups in the newhash signature body.  afaict, the git
sha1<->newhash table is not cryptographically secured underneath
signatures, and thus can't be used in the verification of objects.

The advantage to this approach is that we can be as explicit as
necessary with "SHA-1 -> SHA-512/256" or "SHA-1 -> SHA3-256" in the body
of the message.

thx,

Jason.


Re: [PATCH v4] technical doc: add a design doc for hash function transition

2017-10-02 Thread Junio C Hamano
Jonathan Nieder  writes:

> +Signed Tags
> +~~~
> +We add a new field "gpgsig-newhash" to the tag object format to allow
> +signing tags without relying on SHA-1. Its signed payload is the
> +newhash-content of the tag with its gpgsig-newhash field and "-BEGIN PGP
> +SIGNATURE-" delimited in-body signature removed.
> +
> +This means tags can be signed
> +1. using SHA-1 only, as in existing signed tag objects
> +2. using both SHA-1 and NewHash, by using gpgsig-newhash and an in-body
> +   signature.
> +3. using only NewHash, by only using the gpgsig-newhash field.

I have the same issue with signed commit.

The signed parts for SHA-1 contents exclude the in-body signature
(obviously) and all the headers including gpgsig-newhash that is not
known to our old clients are included.  The signed parts for NewHash
contents exclude the in-body signature and gpgsig-newhash header,
but all other headers.  I somehow feel that we should just reserve
gpgsig-* to prepare for the day when we introduce newhash2 and later
and exclude all of them from the computation.  Treat the difference
between how SHA-1 contents excludes _only_ it knows about and how
NewHash contents excludes _all_ possible signatures, just like the
differece between where SHA-1 and NewHash contents has the
signature.  That is, yes, we didn't know better when we designed
SHA-1 contents, but now we know better and are correcting the
mistakes by moving the signature from in-body tail to a header, and
by excluding anything gpgsig-*, not just the known ones.

> +Mergetag embedding
> +~~
> +The mergetag field in the sha1-content of a commit contains the
> +sha1-content of a tag that was merged by that commit.
> +
> +The mergetag field in the newhash-content of the same commit contains the
> +newhash-content of the same tag.

OK.  

We do not have a tool that extracts them and creates a tag object,
but if such a tool is invented in the future, it would only have to
worry about newhash content, as it would be a local operation.
Makes sense.

> +Submodules
> +~~
> +To convert recorded submodule pointers, you need to have the converted
> +submodule repository in place. The translation table of the submodule
> +can be used to look up the new hash.

OK, I earlier commented on a paragraph that I couldn't tell what it
was talking about, but this is a lot more understandable.  Perhaps
the earlier one can be removed?

We saw earlier what happens during "fetch".  This seems to hint that
we would need to do a "recursive" fetch in the bottom-up direction,
but without fetching the superproject, you wouldn't know what submodules
are needed and from where, so there is a bit of chicken-and-egg problem
we need to address, as we further make the design more detailed.

> +Loose objects and unreachable objects
> +~
> ...
> +"git gc --auto" currently waits for there to be 50 packs present
> +before combining packfiles. Packing loose objects more aggressively
> +may cause the number of pack files to grow too quickly. This can be
> +mitigated by using a strategy similar to Martin Fick's exponential
> +rolling garbage collection script:
> +https://gerrit-review.googlesource.com/c/gerrit/+/35215

Yes, concatenating into the latest pack that still is small may be a
reasonable way, as there won't be many good chances to create good
deltas anyway until you have blobs and trees at sufficiently numbers
of different versions, to do a "quick GC whose only purpose is to
keep the number of loose object down".

> +To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be
> +combined under certain circumstances. If "gc.garbageTtl" is set to
> +greater than one day, then packs created within a single calendar day,
> +UTC, can be coalesced together. The resulting packfile would have an
> +mtime before midnight on that day, so this makes the effective maximum
> +ttl the garbageTtl + 1 day. If "gc.garbageTtl" is less than one day,
> +then we divide the calendar day into intervals one-third of that ttl
> +in duration. Packs created within the same interval can be coalesced
> +together. The resulting packfile would have an mtime before the end of
> +the interval, so this makes the effective maximum ttl equal to the
> +garbageTtl * 4/3.

OK.  

Is the use of mtime essential, or because packs are "write once and
from there access read-only", would a timestamp written somewhere in
the header or the trailer of the file, if existed, work equally
well?  Not a strong objection, but a mild suggestion that not
relying on mtime may be a good idea (it will keep an accidental /
unintended "touch" from keeping garbage alive longer than you want).

> +The UNREACHABLE_GARBAGE setting goes in the PSRC field of the pack
> +index. More generally, that field indicates where a pack came from:
> +
> + - 1 (PACK_SOURCE_RECEIVE) for a pack received over the network
> + - 2 (PACK_SOURCE_AUTO) for a pack created by a 

Re: [PATCH v4] technical doc: add a design doc for hash function transition

2017-10-02 Thread Jason Cooper
On Fri, Sep 29, 2017 at 10:34:13AM -0700, Jonathan Nieder wrote:
> Junio C Hamano wrote:
> > Jonathan Nieder  writes:
...
> > If it is a goal to eventually be able to lose SHA-1 compatibility
> > metadata from the objects, then we might want to remove SHA-1 based
> > signature bits (e.g. PGP trailer in signed tag, gpgsig header in the
> > commit object) from NewHash contents, and instead have them stored
> > in a side "metadata" table, only to be used while converting back.
> > I dunno if that is desirable.
> 
> I don't consider that desirable.
> 
> A SHA-1 based signature is still of historical interest even if my
> centuries-newer version of Git is not able to verify it.

Agreed, even a signature made by a now exposed and revoked key still has
validity.  Especially in a commit or merge.  We know it was made prior
to the key being compromised / revoked.

This is assuming that the keyholder can definitively say "Don't trust
signatures from this key after this date/time+".  And the signature
in question is in the git history prior to that cut off.

Tags are a different animal because they can be added at any time and
aren't directly incorporated into the history.

thx,

Jason.


Re: [PATCH v4] technical doc: add a design doc for hash function transition

2017-10-02 Thread Jason Cooper
Hi Jonathan,

On Wed, Sep 27, 2017 at 09:43:21PM -0700, Jonathan Nieder wrote:
> This document describes what a transition to a new hash function for
> Git would look like.  Add it to Documentation/technical/ as the plan
> of record so that future changes can be recorded as patches.
> 
> Also-by: Brandon Williams 
> Also-by: Jonathan Tan 
> Also-by: Stefan Beller 
> Signed-off-by: Jonathan Nieder 
> ---
> On Thu, Mar 09, 2017 at 11:14 AM, Shawn Pearce wrote:
> > On Mon, Mar 6, 2017 at 4:17 PM, Jonathan Nieder  wrote:
> 
> >> Thanks for the kind words on what had quite a few flaws still.  Here's
> >> a new draft.  I think the next version will be a patch against
> >> Documentation/technical/.
> >
> > FWIW, I like this approach.
> 
> Okay, here goes.
> 
> Instead of sharding the loose object translation tables by first byte,
> we went for a single table.  It simplifies the design and we need to
> keep the number of loose objects under control anyway.
> 
> We also included a description of the transition plan and tried to
> include a summary of what has been agreed upon so far about the choice
> of hash function.
> 
> Thanks to Junio for reviving the discussion and in particular to Dscho
> for pushing this forward and making the missing pieces clearer.
> 
> Thoughts of all kinds welcome, as always.
> 
>  Documentation/Makefile |   1 +
>  .../technical/hash-function-transition.txt | 797 
> +
>  2 files changed, 798 insertions(+)
>  create mode 100644 Documentation/technical/hash-function-transition.txt
> 
...
> diff --git a/Documentation/technical/hash-function-transition.txt 
> b/Documentation/technical/hash-function-transition.txt
> new file mode 100644
> index 00..417ba491d0
> --- /dev/null
> +++ b/Documentation/technical/hash-function-transition.txt
> @@ -0,0 +1,797 @@
> +Git hash function transition
> +
> +
> +Objective
> +-
> +Migrate Git from SHA-1 to a stronger hash function.
> +
...
> +Goals
> +-
> +Where NewHash is a strong 256-bit hash function to replace SHA-1 (see
> +"Selection of a New Hash", below):

Could we clarify and say "a strong hash function with 256-bit output"?

...
> +Overview
> +
> +We introduce a new repository format extension. Repositories with this
> +extension enabled use NewHash instead of SHA-1 to name their objects.
> +This affects both object names and object content --- both the names
> +of objects and all references to other objects within an object are
> +switched to the new hash function.
> +
> +NewHash repositories cannot be read by older versions of Git.
> +
> +Alongside the packfile, a NewHash repository stores a bidirectional
> +mapping between NewHash and SHA-1 object names. The mapping is generated
> +locally and can be verified using "git fsck". Object lookups use this
> +mapping to allow naming objects using either their SHA-1 and NewHash names
> +interchangeably.

nit: Are we presuming that abbreviated hashes won't collide?  Or the
user needs to specify which hash type?

> +Object format
> +~
> +The content as a byte sequence of a tag, commit, or tree object named
> +by sha1 and newhash differ because an object named by newhash-name refers to
> +other objects by their newhash-names and an object named by sha1-name
> +refers to other objects by their sha1-names.
> +
> +The newhash-content of an object is the same as its sha1-content, except
> +that objects referenced by the object are named using their newhash-names
> +instead of sha1-names. Because a blob object does not refer to any
> +other object, its sha1-content and newhash-content are the same.
> +
> +The format allows round-trip conversion between newhash-content and
> +sha1-content.

It would be nice here to explicitly mention deterministic hashing.
Meaning that anyone who converts a commit from sha1 to newhash shall get
the same newhash.

> +
> +Object storage
> +~~
> +Loose objects use zlib compression and packed objects use the packed
> +format described in Documentation/technical/pack-format.txt, just like
> +today. The content that is compressed and stored uses newhash-content
> +instead of sha1-content.
> +
> +Pack index
> +~~
> +Pack index (.idx) files use a new v3 format that supports multiple
> +hash functions. They have the following format (all integers are in
> +network byte order):
> +
> +- A header appears at the beginning and consists of the following:
> +  - The 4-byte pack index signature: '\377t0c'
> +  - 4-byte version number: 3
> +  - 4-byte length of the header section, including the signature and
> +version number
> +  - 4-byte number of objects contained in the pack
> +  - 4-byte number of object formats in this pack index: 2
> +  - For each object format:
> +- 4-byte format identifier (e.g., 'sha1' for SHA-1)

This seems a 

Re: [PATCH v4] technical doc: add a design doc for hash function transition

2017-10-02 Thread Junio C Hamano
Jonathan Nieder  writes:

> +Reading an object's sha1-content
> +
> +The sha1-content of an object can be read by converting all newhash-names
> +its newhash-content references to sha1-names using the translation table.

Sure.

> +Fetch
> +~
> +Fetching from a SHA-1 based server requires translating between SHA-1
> +and NewHash based representations on the fly.
> +
> +SHA-1s named in the ref advertisement that are present on the client
> +can be translated to NewHash and looked up as local objects using the
> +translation table.
> +
> +Negotiation proceeds as today. Any "have"s generated locally are
> +converted to SHA-1 before being sent to the server, and SHA-1s
> +mentioned by the server are converted to NewHash when looking them up
> +locally.

Any of our alternate object store by definition is a NewHash
repository--otherwise we'd violate "no mixing" rule.  It may or may
note have the translation table for its objects.  If it no longer
has the translation table (because it migrated to NewHash only world
before we did), then we can still use it as our alternate but we
cannot use it for the purpose of common ancestore discovery.

> +After negotiation, the server sends a packfile containing the
> +requested objects.

s/objects.$/& These are all SHA-1 contents./

> +We convert the packfile to NewHash format using
> +the following steps:
> +
> +1. index-pack: inflate each object in the packfile and compute its
> +   SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against
> +   objects the client has locally. These objects can be looked up
> +   using the translation table and their sha1-content read as
> +   described above to resolve the deltas.

That procedure would give us the object's SHA-1 contents for
ref-delta objects.  For an ofs-delta object, by definition, its base
object should appear in the same packstream, so we should eventually
be able to get to the SHA-1 contents of the delta base, and from
there we can apply the delta to obtain the SHA-1 contents.  For a
non-delta object, we already have its SHA-1 contents in the
packstream.

So we can get SHA-1 names and SHA-1 contents of each and every
object in the packstream in this step.

Are we actually writing out a .pack/.idx pair that is usable in the
SHA-1 world at this stage?  Or are we going to read from something
we keep in-core in the step #3 below?

> +2. topological sort: starting at the "want"s from the negotiation
> +   phase, walk through objects in the pack and emit a list of them,
> +   excluding blobs, in reverse topologically sorted order, with each
> +   object coming later in the list than all objects it references.
> +   (This list only contains objects reachable from the "wants". If the
> +   pack from the server contained additional extraneous objects, then
> +   they will be discarded.)

Presumably this is a list of SHA-1 names, as we do not yet have
enough information to compute NewHash names yet at this point.  May
want to spell it out here.

Would it discard the auto-followed tags if we do the "traverse from
wants only"?  Traversing the objects in the packfile to find the
"tips" that are not referenced from any other object in the pack
might be necessary, and it shouldn't be too costly, I'd guess.

> +3. convert to newhash: open a new (newhash) packfile. Read the topologically
> +   sorted list just generated. For each object, inflate its
> +   sha1-content, convert to newhash-content, and write it to the newhash
> +   pack. Record the new sha1<->newhash mapping entry for use in the idx.

Are we doing any deltification here?  If we are computing .pack/.idx
pair that can be usable in the SHA-1 world in step #1, then reusing
blob deltas should be trivial (a good delta-base in the SHA-1 world
is a good delta-base in the NewHash world, too).  Things that have
outgoing references like trees, it might be possible that such a
heuristic may not give us the absolute best delta-base, but I guess
it would still be a good approximation to reuse the delta/base
object relationship in SHA-1 world to NewHash world, assuming that
the server did a good job choosing the bases.

> +4. sort: reorder entries in the new pack to match the order of objects
> +   in the pack the server generated and include blobs. Write a newhash idx
> +   file

OK.

> +5. clean up: remove the SHA-1 based pack file, index, and
> +   topologically sorted list obtained from the server in steps 1
> +   and 2.

Ah, OK, so we do write the SHA_1 pack/idx in the first step.  OK.

> +Push
> +
> +Push is simpler than fetch because the objects referenced by the
> +pushed objects are already in the translation table. The sha1-content
> +of each object being pushed can be read as described in the "Reading
> +an object's sha1-content" section to generate the pack written by git
> +send-pack.

OK.

> +Signed Commits
> +~~
> +We add a new field "gpgsig-newhash" to the commit object format to allow
> 

Re: [PATCH v4] technical doc: add a design doc for hash function transition

2017-10-02 Thread Junio C Hamano
Jonathan Nieder  writes:

>>> +6. Skip fetching some submodules of a project into a NewHash
>>> +   repository. (This also depends on NewHash support in Git
>>> +   protocol.)
>>
>> It is unclear what this means.  Around submodule support, one thing
>> I can think of is that a NewHash tree in a superproject would record
>> a gitlink that is a NewHash commit object name in it, therefore it
>> cannot refer to an unconverted SHA-1 submodule repository.  But it
>> is unclear if the above description refers to the same issue, or
>> something else.
>
> It refers to that issue.

We may want to find a way to make it clear, then.

>> It makes me wonder if we want to add the hashname in this object
>> header.  "length" would be different for non-blob objects anyway,
>> and it is not "compat metadata" we want to avoid baked in, yet it
>> would help diagnose a mistake of attempting to use a "mixed" objects
>> in a single repository.  Not a big issue, though.
>
> Do you mean that adding the hashname into the computation that
> produces the object name would help in some use case?

What I mean is that for SHA-1 objects we keep the object header to
be "  NUL".  For objects in newer world, use the
object header to "   NUL", and include the
hashname in the object name computation.

> For loose objects, it would be nice to name the hash in the file, so
> that "file" can understand what is happening if someone accidentally
> mixes types using "cp".  The only downside is losing the ability to
> copy blobs (which have the same content despite being named using
> different hashes) between repositories after determining their new
> names.  That doesn't seem like a strong downside --- it's pretty
> harmless to include the hash type in loose object files, too.  I think
> I would prefer this to be a "magic number" instead of part of the
> zlib-deflated payload, since this way "file" can discover it more
> easily.

Yeah, thanks for doing pros-and-cons for me ;-)

>> If it is a goal to eventually be able to lose SHA-1 compatibility
>> metadata from the objects, then we might want to remove SHA-1 based
>> signature bits (e.g. PGP trailer in signed tag, gpgsig header in the
>> commit object) from NewHash contents, and instead have them stored
>> in a side "metadata" table, only to be used while converting back.
>> I dunno if that is desirable.
>
> I don't consider that desirable.

Agreed.  Let's not go there.

>> Hmm, as the corresponding packfile stores object data only in
>> NewHash content format, it is somewhat curious that this table that
>> stores CRC32 of the data appears in the "Tables for each object
>> format" section, as they would be identical, no?  Unless I am
>> grossly misleading the spec, the checksum should either go outside
>> the "Tables for each object format" section but still in .idx, or
>> should be eliminated and become part of the packdata stream instead,
>> perhaps?
>
> It's actually only present for the first object format.  Will find a
> better way to describe this.

I see.  One way to do so is to have it upfront before the "after
this point, these tables repeat for each of the hashes" part of the
file.

>> Oy.  So we can go from a short prefix to the pack location by first
>> finding it via binsearch in the short-name table, realize that it is
>> nth object in the object name order, and consulting this table.
>> When we know the pack-order of an object, there is no direct way to
>> go to its location (short of reversing the name-order-to-pack-order
>> table)?
>
> An earlier version of the design also had a pack-order-to-pack-offset
> table, but we weren't able to think of any cases where that would be
> used without also looking up the object name that can be used to
> verify the integrity of the inflated object.

The primary thing I was interested in knowing was if we tried to
think of any case where it may be useful and then didn't think of
any---I couldn't but I know I am not imaginative enough, and I
wanted to know you guys didn't, either.


Re: [PATCH v4] technical doc: add a design doc for hash function transition

2017-09-29 Thread Jonathan Nieder
Junio C Hamano wrote:
> Jonathan Nieder  writes:

>> This document describes what a transition to a new hash function for
>> Git would look like.  Add it to Documentation/technical/ as the plan
>> of record so that future changes can be recorded as patches.
>>
>> Also-by: Brandon Williams 
>> Also-by: Jonathan Tan 
>> Also-by: Stefan Beller 
>> Signed-off-by: Jonathan Nieder 
>> ---
>
> Shoudln't these all be s-o-b: (with a note immediately before that
> to say all four contributed equally or something)?

I don't want to get lost in the weeds in the question of how to
represent such a collaborative effort in git's metadata.

You're right that I should collect their sign-offs!  Your approach of
using text instead of machine-readable data for common authorship also
seems okay.

In any event, this is indeed

Signed-off-by: Brandon Williams 
Signed-off-by: Jonathan Tan 
Signed-off-by: Stefan Beller 

(I just checked :)).

>> +Background
>> +--
>> +At its core, the Git version control system is a content addressable
>> +filesystem. It uses the SHA-1 hash function to name content. For
>> +example, files, directories, and revisions are referred to by hash
>> +values unlike in other traditional version control systems where files
>> +or versions are referred to via sequential numbers. The use of a hash
>
> Traditional systems refer to files via numbers???  Perhaps "where
> versions of files are referred to via sequential numbers" or
> something?

Good point.  The wording you suggested will work well.

>> +function to address its content delivers a few advantages:
>> +
>> +* Integrity checking is easy. Bit flips, for example, are easily
>> +  detected, as the hash of corrupted content does not match its name.
>> +* Lookup of objects is fast.
>
> * There is no ambiguity what the object's name should be, given its
>   content.
>
> * Deduping the same content copied across versions and paths is
>   automatic.

:)  Yep, these are nice too, especially that second one.

It also is how we make diff-ing fast.

>> +SHA-1 still possesses the other properties such as fast object lookup
>> +and safe error checking, but other hash functions are equally suitable
>> +that are believed to be cryptographically secure.
>
> s/secure/more &/, perhaps?

We were looking for a phrase meaning that it should be a cryptographic
hash function in good standing, which SHA-1 is at least approaching
not being.

"more secure" should work fine.  Let's go with that.

>> +Goals
>> +-
>> +...
>> +   c. Users can use SHA-1 and NewHash identifiers for objects
>> +  interchangeably (see "Object names on the command line", below).
>
> Mental note.  This needs to extend to the "index X..Y" lines in the
> patch output, which is used by "apply -3" and "am -3".

Will add a note about this to "Object names on the command line".  Stefan
had already pointed out that that section should really be renamed to
something like "Object names in input and output".

>> +2. Allow a complete transition away from SHA-1.
>> +   a. Local metadata for SHA-1 compatibility can be removed from a
>> +  repository if compatibility with SHA-1 is no longer needed.
>
> I like the emphasis on "Local" here.  Metadata for compatiblity that
> is embedded in the objects obviously cannot be removed.
>
> From that point of view, one of the goals ought to be "make sure
> that as much SHA-1 compatibility metadata as possible is local and
> outside the object".  This goal may not be able to say more than "as
> much as possible", as signed objects that came from SHA-1 world
> needs to carry the compatibility metadata somewhere somehow.
>
> Or perhaps we could.  There is nothing that says a signed tag
> created in the SHA-1 world must have the PGP/SHA-1 signature in the
> NewHash payload---it could be split off of the object data and
> stored in a local metadata cache, to be used only when we need to
> convert it back to the SHA-1 world.

That would break round-tripping and would mean that multiple SHA-1
objects could have the same NewHash name.  In other words, from
my point of view there is something that says that such data must
be preserved.

Another way to put it: even after removing all SHA-1 compatibility
metadata, one nice feature of this design is that it can be recovered
if I change my mind, from data in the NewHash based repository alone.

[...]
>> +Non-Goals
>> +-
>> ...
>> +6. Skip fetching some submodules of a project into a NewHash
>> +   repository. (This also depends on NewHash support in Git
>> +   protocol.)
>
> It is unclear what this means.  Around submodule support, one thing
> I can think of is that a NewHash tree in a superproject would record
> a gitlink that is a NewHash commit object name in it, therefore it
> cannot refer to an unconverted SHA-1 submodule repository.  But 

Re: [PATCH v4] technical doc: add a design doc for hash function transition

2017-09-29 Thread Junio C Hamano
Junio C Hamano  writes:

> Or perhaps we could.  There is nothing that says a signed tag
> created in the SHA-1 world must have the PGP/SHA-1 signature in the
> NewHash payload---it could be split off of the object data and
> stored in a local metadata cache, to be used only when we need to
> convert it back to the SHA-1 world.
> ...
>> +The format allows round-trip conversion between newhash-content and
>> +sha1-content.
>
> If it is a goal to eventually be able to lose SHA-1 compatibility
> metadata from the objects, then we might want to remove SHA-1 based
> signature bits (e.g. PGP trailer in signed tag, gpgsig header in the
> commit object) from NewHash contents, and instead have them stored
> in a side "metadata" table, only to be used while converting back.
> I dunno if that is desirable.

Let's keep it simple by ignoring all of the above.  Even though
leaving the sha1-gpgsig and other crufts would etch these
compatibility metadata in objects forever, these remain only in
objects that originate from SHA-1 world, or in objects created in
the NewHash world only while the project participants still care
about SHA-1 compatibility.  Strictly speaking, it would be super
nice if we can do without contaminating these newly created objects
with SHA-1 compatibility headers, just like we wish to be able to
drop the SHA-1 vs NewHash mapping table after projects participants
stop careing about SHA-1 compatiblity, it may not be worth it.  Of
course, if we decide to spend a bit more brain cycle to design how
we push these out of the object proper, the same solution would
automatically allow us to omit SHA-1 compatibility headers from the
objects that were converted from SHA-1 world.
>
>> +  - A table of 4-byte CRC32 values of the packed object data, in the
>> +order that the objects appear in the pack file. This is to allow
>> +compressed data to be copied directly from pack to pack during
>> +repacking without undetected data corruption.
>
> An obvious alternative would be to have the CRC32 checksum near
> (e.g. immediately before) the object data in the packfile (as
> opposed to the .idx file like this document specifies).  I am not
> sure what the pros and cons are between the two, though, and that is
> why I mention the possiblity here.
>
> Hmm, as the corresponding packfile stores object data only in
> NewHash content format, it is somewhat curious that this table that
> stores CRC32 of the data appears in the "Tables for each object
> format" section, as they would be identical, no?  Unless I am
> grossly misleading the spec, the checksum should either go outside
> the "Tables for each object format" section but still in .idx, or
> should be eliminated and become part of the packdata stream instead,
> perhaps?

Thinking about this a bit more, I think a single table per .idx file
would be the right way to go, not a checksum immediately after or
before the object data that is embedded in the pack stream.  In the
NewHash world (after this initial migration), we would want to be
able to stream NewHash packstream that comes from the network
straight to disk, which would mean these in-line CRC32 data would
need to be sent over the wire (i.e. 4-byte per object sent); that is
an unneeded overhead, as the packstream has its trailing checksum to
protect the whole thing anyway.


Re: [PATCH v4] technical doc: add a design doc for hash function transition

2017-09-29 Thread Junio C Hamano
Jonathan Nieder  writes:

> This document describes what a transition to a new hash function for
> Git would look like.  Add it to Documentation/technical/ as the plan
> of record so that future changes can be recorded as patches.
>
> Also-by: Brandon Williams 
> Also-by: Jonathan Tan 
> Also-by: Stefan Beller 
> Signed-off-by: Jonathan Nieder 
> ---

Shoudln't these all be s-o-b: (with a note immediately before that
to say all four contributed equally or something)?

> +Background
> +--
> +At its core, the Git version control system is a content addressable
> +filesystem. It uses the SHA-1 hash function to name content. For
> +example, files, directories, and revisions are referred to by hash
> +values unlike in other traditional version control systems where files
> +or versions are referred to via sequential numbers. The use of a hash

Traditional systems refer to files via numbers???  Perhaps "where
versions of files are referred to via sequential numbers" or
something?

> +function to address its content delivers a few advantages:
> +
> +* Integrity checking is easy. Bit flips, for example, are easily
> +  detected, as the hash of corrupted content does not match its name.
> +* Lookup of objects is fast.

* There is no ambiguity what the object's name should be, given its
  content.

* Deduping the same content copied across versions and paths is
  automatic.

> +SHA-1 still possesses the other properties such as fast object lookup
> +and safe error checking, but other hash functions are equally suitable
> +that are believed to be cryptographically secure.

s/secure/more &/, perhaps?

> +Goals
> +-
> +...
> +   c. Users can use SHA-1 and NewHash identifiers for objects
> +  interchangeably (see "Object names on the command line", below).

Mental note.  This needs to extend to the "index X..Y" lines in the
patch output, which is used by "apply -3" and "am -3".

> +2. Allow a complete transition away from SHA-1.
> +   a. Local metadata for SHA-1 compatibility can be removed from a
> +  repository if compatibility with SHA-1 is no longer needed.

I like the emphasis on "Local" here.  Metadata for compatiblity that
is embedded in the objects obviously cannot be removed.

>From that point of view, one of the goals ought to be "make sure
that as much SHA-1 compatibility metadata as possible is local and
outside the object".  This goal may not be able to say more than "as
much as possible", as signed objects that came from SHA-1 world
needs to carry the compatibility metadata somewhere somehow.  

Or perhaps we could.  There is nothing that says a signed tag
created in the SHA-1 world must have the PGP/SHA-1 signature in the
NewHash payload---it could be split off of the object data and
stored in a local metadata cache, to be used only when we need to
convert it back to the SHA-1 world.

But I am getting ahead of myself before reading the proposal
through.

> +Non-Goals
> +-
> ...
> +6. Skip fetching some submodules of a project into a NewHash
> +   repository. (This also depends on NewHash support in Git
> +   protocol.)

It is unclear what this means.  Around submodule support, one thing
I can think of is that a NewHash tree in a superproject would record
a gitlink that is a NewHash commit object name in it, therefore it
cannot refer to an unconverted SHA-1 submodule repository.  But it
is unclear if the above description refers to the same issue, or
something else.

> +Overview
> +
> +We introduce a new repository format extension. Repositories with this
> +extension enabled use NewHash instead of SHA-1 to name their objects.
> +This affects both object names and object content --- both the names
> +of objects and all references to other objects within an object are
> +switched to the new hash function.
> +
> +NewHash repositories cannot be read by older versions of Git.
> +
> +Alongside the packfile, a NewHash repository stores a bidirectional
> +mapping between NewHash and SHA-1 object names. The mapping is generated
> +locally and can be verified using "git fsck". Object lookups use this
> +mapping to allow naming objects using either their SHA-1 and NewHash names
> +interchangeably.
> +
> +"git cat-file" and "git hash-object" gain options to display an object
> +in its sha1 form and write an object given its sha1 form.

Both of these are somewhat unclear.  I am guessing that "git
cat-file --convert-to=sha1  " would emit the
object contents converted from their NewHash payload to SHA-1
payload (blobs are unchanged, trees, commits and tags get their
outgoing references converted from NewHash to their SHA-1
counterparts), and that is what you mean by "options to display an
object in its sha1 form".  

I am not sure how "git hash-object" with the option would work,
though.  Do you give an option "--hash=sha1 --stdout --stdin -t
" to feed a NewHash contents (file,