Re: SHA1 collisions found

Junio C Hamano Fri, 24 Feb 2017 22:19:57 -0800

Linus Torvalds <torva...@linux-foundation.org> writes:

> For example, what I would suggest the rules be is something like this:
>
>  - introduce new tag2/commit2/tree2/blob2 object type tags that imply
> that they were hashed using the new hash
>
>  - an old type obviously can never contain a pointer to a new type (ie
> you can't have a "tree" object that contains a tree2 object or a blob2
> object.
>
>  - but also make the rule that a *new* type can never contain a
> pointer to an old type, with the *very* specific exception that a
> commit2 can have a parent that is of type "commit".


OK, I think that is what Peff was suggesting in his message, and I
do not have problem with such a transition plan.  Or the *very*
specific exception could be that a reference to "commit" can use old
name (which would allow binding a submodule before transition to a
new project).

We probably do not need "blob2" object as they do not embed any
pointer to another thing.  A loose blob with old name can be made
available on the filesystem also under new name without much "heavy"
transition, and an in-pack blob can be pointed at with _two_ entries
in the updated pack index file under old and new names, both for the
base (just deflated) representation and also ofs-delta.  A ref-delta
based on another blob with old name may need a bit of special
handling, but the deltification would not be visible at the "struct object"
layer, so probably not such a big deal.

We may also be able to get away without "commit2" and "tag2" as
their pointers can be widened and parse_{commit,tag}_object() should
be able to deal with objects with new names transparently.  "tree2"
may be a bit tricky, though, but offhand it seems to me that nothing
is insurmountable.

> That way everything "converges" towards the new format: the only way
> you can stay on the old format is if you only have old-format objects,
> and once you have a new-format object all your objects are going to be
> new format - except for the history.

Yes.

> So you will end up with duplicate objects, and that's not good (think
> of what it does to all our full-tree "diff" optimizations, for example
> - you no longer get the "these sub-trees are identical" across a
> format change), but realistically you'll have a very limited time of
> that kind of duplication.
>
> I'd furthermore suggest that from a UI standpoint, we'd
>
>  - convert to 64-character hex numbers (32-byte hashes)
>
>  - (as mentioned earlier) default to a 40-character abbreviation
>
>  - make the old 40-character SHA1's just show up within the same
> address space (so they'd also be encoded as 32-byte hashes, just with
> the last 12 bytes zero).

Yes to all of the above.

>  - you'd see in the "object->type" whether it's a new or old-style hash.

I am not sure if this is needed.  We may need to abstract tree_entry walker
a little bit as a preparatory step, but I suspect that the hash (and
more importantly the internal format) can be kept as an internal
knowledge to the object layer (i.e. {commit,tree,tag}.c).

So,... thanks for straightening me out.  I was thinking we would
need mixed mode support for smoother transition, but it now seems to
me that the approach to stratify the history into old and new is
workable.

Re: SHA1 collisions found

Reply via email to