On Wed, Mar 1, 2017 at 4:46 PM, Mike Hommey <m...@glandium.org> wrote:
> On Wed, Mar 01, 2017 at 04:34:43PM -0800, Gregory Szorc wrote: > > On Wed, Mar 1, 2017 at 7:02 AM, FUJIWARA Katsunori < > fo...@lares.dti.ne.jp> > > wrote: > > > > > # HG changeset patch > > > # User FUJIWARA Katsunori <fo...@lares.dti.ne.jp> > > > # Date 1488380487 -32400 > > > # Thu Mar 02 00:01:27 2017 +0900 > > > # Node ID 018d9759cb93f116007d4640341a82db6cf2d45c > > > # Parent 0bb3089fe73527c64f1afc40b86ecb8dfe7fd7aa > > > similar: allow similarity detection to use sha256 for digesting file > > > contents > > > > > > Before this patch, similarity detection logic (used for addremove and > > > automv) uses SHA-1 digesting. But this cause incorrect rename > > > detection, if: > > > > > > - removing file A and adding file B occur at same committing, and > > > - SHA-1 hash values of file A and B are same > > > > > > This may prevent security experts from managing sample files for > > > SHAttered issue in Mercurial repository, for example. > > > > > > https://security.googleblog.com/2017/02/announcing-first- > > > sha1-collision.html > > > https://shattered.it/ > > > > > > Hash collision itself isn't so serious for core repository > > > functionality of Mercurial, described by mpm as below, though. > > > > > > https://www.mercurial-scm.org/wiki/mpm/SHA1 > > > > > > HOW ABOUT: > > > > > > - which should we use default algorithm SHA-1, SHA-256 or SHA-512 ? > > > > > > > SHA-512 should be faster than SHA-256 on 64-bit hardware. So, there's > > likely no good reason to use SHA-256 for simple identity checks. > > > > > > > > > > ease (handling problematic files safely by default) or > > > performance? > > > > > > > > > On my Skylake at 4.0 GHz, SHA-1 is capable of running at ~975 MB/s and > > SHA-512 at ~700 MB/s. Both are fast enough that for simple one-time > content > > identity checks, hashing shouldn't be a bottleneck, at least not for most > > repos. > > > > So I think it is fine to change this function from SHA-1 to SHA-512 > > assuming the hashes don't "leak" into storage. If they end up being > stored > > or used for something other than identity checks, then we need to bloat > > scope to discuss our general hashing future. And that needs its own > thread > > ;) > > With hashing, there is *always* the risk of collision. It might be tiny, > but it still exists. Why not just compare the contents when the hash > match? Then it doesn't really matter what the hash is. The hash is just > a shortcut to avoid comparing full contents in a O(n^2) fashion. > > There aren't going to be that many hash matches anyways, comparing the > content then should not make a significant difference in speed, but > would guarantee that the "similarity" is real. > Yeah, the objects will already be in memory. Under the hood bytes.__eq__ just looks at the buffer length then falls back to memcmp, which should be on the order of 10 GB/s. So using __eq__ here seems reasonable. > > (BTW, interestingly, in terms of similarity detection, while the > SHAttered PDFs are not 100% identical, they are 80%+ similar) > > Mike >
_______________________________________________ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel