Re: [PATCH RFC] similar: allow similarity detection to use sha256 for digesting file contents

Gregory Szorc Wed, 01 Mar 2017 17:39:35 -0800

On Wed, Mar 1, 2017 at 4:46 PM, Mike Hommey <m...@glandium.org> wrote:


> On Wed, Mar 01, 2017 at 04:34:43PM -0800, Gregory Szorc wrote:
> > On Wed, Mar 1, 2017 at 7:02 AM, FUJIWARA Katsunori <
> fo...@lares.dti.ne.jp>
> > wrote:
> >
> > > # HG changeset patch
> > > # User FUJIWARA Katsunori <fo...@lares.dti.ne.jp>
> > > # Date 1488380487 -32400
> > > #      Thu Mar 02 00:01:27 2017 +0900
> > > # Node ID 018d9759cb93f116007d4640341a82db6cf2d45c
> > > # Parent  0bb3089fe73527c64f1afc40b86ecb8dfe7fd7aa
> > > similar: allow similarity detection to use sha256 for digesting file
> > > contents
> > >
> > > Before this patch, similarity detection logic (used for addremove and
> > > automv) uses SHA-1 digesting. But this cause incorrect rename
> > > detection, if:
> > >
> > >   - removing file A and adding file B occur at same committing, and
> > >   - SHA-1 hash values of file A and B are same
> > >
> > > This may prevent security experts from managing sample files for
> > > SHAttered issue in Mercurial repository, for example.
> > >
> > >   https://security.googleblog.com/2017/02/announcing-first-
> > > sha1-collision.html
> > >   https://shattered.it/
> > >
> > > Hash collision itself isn't so serious for core repository
> > > functionality of Mercurial, described by mpm as below, though.
> > >
> > >   https://www.mercurial-scm.org/wiki/mpm/SHA1
> > >
> > > HOW ABOUT:
> > >
> > >   - which should we use default algorithm SHA-1, SHA-256 or SHA-512 ?
> > >
> >
> > SHA-512 should be faster than SHA-256 on 64-bit hardware. So, there's
> > likely no good reason to use SHA-256 for simple identity checks.
> >
> >
> > >
> > >     ease (handling problematic files safely by default) or
> > >     performance?
> > >
> >
> >
> > On my Skylake at 4.0 GHz, SHA-1 is capable of running at ~975 MB/s and
> > SHA-512 at ~700 MB/s. Both are fast enough that for simple one-time
> content
> > identity checks, hashing shouldn't be a bottleneck, at least not for most
> > repos.
> >
> > So I think it is fine to change this function from SHA-1 to SHA-512
> > assuming the hashes don't "leak" into storage. If they end up being
> stored
> > or used for something other than identity checks, then we need to bloat
> > scope to discuss our general hashing future. And that needs its own
> thread
> > ;)
>
> With hashing, there is *always* the risk of collision. It might be tiny,
> but it still exists. Why not just compare the contents when the hash
> match? Then it doesn't really matter what the hash is. The hash is just
> a shortcut to avoid comparing full contents in a O(n^2) fashion.
>
> There aren't going to be that many hash matches anyways, comparing the
> content then should not make a significant difference in speed, but
> would guarantee that the "similarity" is real.
>

Yeah, the objects will already be in memory. Under the hood bytes.__eq__
just looks at the buffer length then falls back to memcmp, which should be
on the order of 10 GB/s. So using __eq__ here seems reasonable.


>
> (BTW, interestingly, in terms of similarity detection, while the
> SHAttered PDFs are not 100% identical, they are 80%+ similar)
>
> Mike
>

_______________________________________________
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel

Re: [PATCH RFC] similar: allow similarity detection to use sha256 for digesting file contents

Reply via email to