Johannes Schindelin <johannes.schinde...@gmx.de> writes:

> - the most important part will be the patch turning core.enableSHA1DC
>   into a tristate: "externalOnly" or "smart" or "auto" or something
>   indicating that it switches on collision detection only for commands
>   that accept objects from an outside source into the local repository,
>   such as fetch, fast-import, etc

There are different uses of SHA-1 hashing in Git, and I do agree
that depending on the use, some of them do not need the overhead for
the collision-attack detection.  As DC_SHA1 with attack detection
disabled may still be slower than other implementations, it may make
sense to have a compile-time option to use DC_SHA1 for places that
needs protection, and another implementation [*1*] for places that
don't.

I think the places that MUST use DC variant is anything that hashes
content for a single object to compute its object name.  "git add"
[*2*] that creates a new loose object, "git index-pack" that reads
incoming pack stream over the wire, reconstitutes each object data
while resolving delta and hash them to get their names to construct
the .idx file and "git unpack-objects" that does the same but
explodes the pack contents into loose objects, "git write-tree" that
creates a new tree object given the contents of the index, etc.

One notable exception of the above is "update-index --refresh".  We
already have contents in the index and in the object store, and we
are hashing the contents in the working tree to see if it hashes to
the same value.  When the hash does not match, it won't go in to the
object store.  When the hash does match, it either is indeed the
same content (i.e. no collision), in which case we earlier must have
done the collision-attack detecting hash when we added the object to
the object store.  Or the object we have in the object store and
what is in the working tree are different contents that hash to the
same name, in which case the user already has colliding pair and it
is too late to invoke collision-attack detecting variant ;-)

The running checksum over the whole file csum-file.c computes does
not have to be the collision-attack detecting kind.  This is the
hash at the end of various files like the index, .pack .idx, etc.
These are used to protect us against bit-flipping disks and we are
not fighting with a clever disk that can do collision attack.  For
that matter, some of these checksums do not even have to be SHA-1.
If one hacks his own Git to replace SHA-1 checksum at the end of the
index file with something faster and weaker and use it in one's
repository, nobody else would notice nor care.  The same thing can
be said for the .idx file.  The one at the end of .pack does get
checked at the receiving end when it comes over the wire, so it MUST
be SHA-1, but it does not have to be hashed with collision-attack
detection.

The rerere database is indexed with SHA-1 of (normalized) conflict
text.  This does not have to be hashed with the collision-attack
detection logic.  Thanks to recent update that allows multiple pairs
of conflict and its resolution, the subsystem is prepared to see two
conflicts that share the same hash already (for completely different
reasons).

The hash that names a packfile is constructed by sorting all the
names of the objects contained in the packfile and running SHA-1
hash over it.  I think this MUST be hashed with collision-attack
detection.  A malicious site can feed you a packfile that contains
objects the site crafts so that the sorted object names would result
in a collision-attack, ending up with one pack that contains a sets
of objects different from another pack that happens to have the same
packname, causing Git to say "Ah, this new pack must have the same
set of objects as the pack we already have" and discard it,
resulting in lost objects and a corrupt repository with missing
objects.


[Footnote]

*1* I think this PREVIEW hardcodes OpenSSL only for illustration and
    that is OK for a preview.  Given the recent news on licensing,
    however, if we want to pursue this dual hashing scheme, we must
    consider allowing other implementations as well in the final
    form.

*2* In this paragraph, whenever "git" command is named, I mean both
    the command and its underlying machinery.  When I say "git
    write-tree", for example, write_index_as_tree() obviously is
    included.

Reply via email to