On Sun, Feb 26, 2017 at 10:38:35PM +0100, Ævar Arnfjörð Bjarmason wrote:

> On Sun, Feb 26, 2017 at 8:11 PM, Linus Torvalds
> <torva...@linux-foundation.org> wrote:
> > But yes, SHA3-256 looks like the sane choice. Performance of hashing
> > is important in the sense that it shouldn't _suck_, but is largely
> > secondary. All my profiles on real loads (well, *my* real loads) have
> > shown that zlib performance is actually much more important than SHA1.
> 
> What's the zlib v.s. hash ratio on those profiles? If git is switching
> to another hashing function given the developments in faster
> compression algorithms (gzip v.s. snappy v.s. zstd v.s. lz4)[1] we'll
> probably switch to another compression algorithm sooner than later.
> 
> Would compression still be the bottleneck by far with zstd, how about with 
> lz4?
> 
> 1. 
> https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/

zstd does help in normal operations that access lots of blobs. Here are
some timings:

  
http://public-inbox.org/git/20161023080552.lma2v6zxmyaii...@sigill.intra.peff.net/

Compression is part of the on-the-wire packfile format, so it introduces
compatibility headaches. Unlike the hash, it _can_ be a local thing
negotiated between the two ends, and a server with zstd data could
convert on-the-fly to zlib. You just wouldn't want to do so on a server
because it's really expensive (or you double your cache footprint to
store both).

If there were a hash flag day, we _could_ make sure all post-flag-day
implementations have zstd, and just start using that (it transparently
handles old zlib data, too). I'm just hesitant to through in the kitchen
sink and make the hash transition harder than it already is.

Hash performance doesn't matter much for normal read operations. If your
implementation is really _slow_ it does matter for a few operations
(notably index-pack receiving a large push or fetch). Some timings:

  
http://public-inbox.org/git/20170223230621.43anex65ndoqb...@sigill.intra.peff.net/

If the new algorithm is faster than SHA-1, that might be measurable in
those operations, too, but obviously less dramatic, as hashing is just a
percentage of the total operation (so it can balloon the time if it's
slow, but optimizing it can only save so much).

I don't know if the per-hash setup cost of any of the new algorithms is
higher than SHA-1. We care as much about hashing lots of small content
as we do about sustained throughput of a single hash.

-Peff

Reply via email to