On Fri, Mar 03, 2017 at 05:12:51PM -0800, Jonathan Nieder wrote:

> This past week we came up with this idea for what a transition to a new
> hash function for Git would look like.  I'd be interested in your
> thoughts (especially if you can make them as comments on the document,
> which makes it easier to address them and update the document).

Overall it's an interesting idea. I thought at first that you were
suggesting servers do on-the-fly conversion, but after a more careful
reading that isn't the case. And I don't think that would work, because
the conversion is expensive.

So this pushes the conversion cost onto the clients who decide to move
to SHA-256. That may be a problem for sites which have a lot of clients
(like CI hosts). But I guess they would just stick with SHA-1 as long as
possible, until the upstream repo switches (and that _is_ a per-repo
flag day, because the upstream host isn't going to convert back to SHA-1
on the fly to serve the old clients).

> You can use the doc URL
> 
>  https://goo.gl/gh2Mzc

I'd encourage anybody following along to follow that link. I almost
didn't, but there are a ton of comments there (I'm not sure how I feel
about splitting the discussion off the list, though).

> Goals
> -----
> 1. The transition to SHA256 can be done one local repository at a time.
>    a. Requiring no action by any other party.
>    b. A SHA256 repository can communicate with SHA-1 Git servers and
>       clients (push/fetch).
>    c. Users can use SHA-1 and SHA256 identifiers for objects
>       interchangeably.
>    d. New signed objects make use of a stronger hash function than
>       SHA-1 for their security guarantees.
> 2. Allow a complete transition away from SHA-1.
>    a. Local metadata for SHA-1 compatibility can be dropped in a
>       repository if compatibility with SHA-1 is no longer needed.

I suspect we'll never get away from keeping the mapping table. You'll
need at least the sha1->sha256 table if you want to look up names found
in historic commit messages, mailing list posts, etc.

And you'll need the sha256->sha1 table if you want to verify the gpg
signatures on old tags and commits. That might be something people are
willing to drop, though.

> After negotiation, the server sends a packfile containing the
> requested objects. We convert the packfile to SHA-256 format using the
> following steps:
> 
> 1. index-pack: inflate each object in the packfile and compute its
>    SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against
>    objects the client has locally. These objects can be looked up using
>    the translation table and their sha1-content read as described above
>    to resolve the deltas.
> 2. topological sort: starting at the "want"s from the negotiation
>    phase, walk through objects in the pack and emit a list of them in
>    topologically sorted order. (This list only contains objects
>    reachable from the "wants". If the pack from the server contained
>    additional extraneous objects, then they will be discarded.)

I don't think we do this right now, but you can actually find the entry
(and exit) points of a pack during the index-pack step. Basically:

  1. Keep a hashmap of objects mentioned in the pack.

  2. When we process an object's content (i.e., compute its hash), also
     parse it for any object references. Add entries in the hashmap for
     any object mentioned this way. Mark the entry for the object we
     processed with a "HAVE" bit, and mark any referenced object with a
     "REF" bit.

  3. After processing all objects, anything with a "HAVE" but no "REF"
     is an entry point to the pack (i.e., something that we should have
     asked for with a want). Anything with a "REF" but not a "HAVE" is
     an exit point (i.e., an object that we are expected to already have
     in our repo).

     (I've thought about this before because we could possibly shortcut
     the connectivity check using the exit points. It's complicated by
     the fact that we don't assume the transitive presence of objects
     unless they are reachable).

I don't think using the "want"s as the entry points is unreasonable,
though. The server _shouldn't_ generally be sending us other cruft.

I do wonder if you might be able to omit the extra object-graph walk
from your step 2, if you could assign "depths" to each object during
step 1 instead of HAVE/REF bits. The trouble, of course, is that you're
not visiting the nodes in the right order (so given two trees, you're
not sure if one might eventually be a child of the other; how do you
assign their depths?). I have a feeling there's a proof that it's
impossible, but I might just not be clever enough.


Overall the basics of the conversion seem sound to me. The "nohash"
things seems more complicated than I think it ought to be, which
probably just means I'm missing something.  I left a few related
comments on the google doc, so I won't repeat them here.

-Peff

Reply via email to