On 01.03.2017 13:52, Stefan Fuhrmann wrote: > On 01.03.2017 05:17, Greg Stein wrote: >> I really like this idea. >> >> And we could take a copy of APR's sha1 code, and rejigger it to >> perform *both* hashes during the same scan of the raw bytes. I would >> expect the time taken to extend by (say) 1.1X rather than a full 2X. >> The inner loop might cost a bit more, but we'd only scan the bytes >> once. Very handy, when you're talking about megabytes in a stream-y >> environment. >> >> (and medium-term, push this dual-sha1 computation back into APR) >> >> >> On Sun, Feb 26, 2017 at 10:08 AM, Garance A Drosehn <dro...@rpi.edu >> <mailto:dro...@rpi.edu>> wrote: >> >> On 24 Feb 2017, at 15:46, Stefan Sperling wrote: >> > >> > I believe we should prepare a new working format for 1.10.0 >> > which addresses this problem. I don't see a good way of fixing >> > it without a format bump. The bright side of this is that it >> > gives us a good reason to get 1.10.0 ready ASAP. >> > >> > We can switch to a better hash algorithm with a WC format >> > bump. >> >> One of the previous messages mentioned that better hash >> algorithms are more expensive. So let me mention a tactic >> that I used many years ago, when MD5 was the best digest >> algorithm that I knew of, and I didn't trust it for the >> larger files I was working with at the time: >> >> Instead of going with a completely different hash algorithm, >> just double-down on the one you're using. What I did was to >> calculate one digest the standard way, and then a second one >> which summed up every-other-byte (or every 3rd byte, or ...). >> So to get a collision, not only do two files have to get the >> same digest-result for all their data, but they have to also >> get the same digest-result when exactly half the data is >> skipped over. >> >> (I did this a long time ago, and forget the details. What >> I may have done for performance reasons was every-other-word, >> not every-other-byte) >> >> My thinking was that *any* single algorithm which processes >> all the data is going to get collisions, eventually. But it >> will be much harder for someone to generate a duplicate file >> where there will also be a collision when summing up only >> half of the data. >> >> I'm not claiming this is great cure-all solution, but just >> that it's an alternate tactic which might be interesting. >> People could create repositories with just the one digest, >> or upgrade it to use multiple digests if they have the need. >> >> I found a few benchmarks which suggest that sha-256 is maybe >> twice as expensive as sha-1, so calculating two sha-1 digests >> might be a reasonable alternative. >> > > That is also known as bit-slicing. The neat thing is that > you create N (e.g. 4) interleaved sub-streams who's > checksum can be calculated *concurrently*, e.g using > SIMD instructions. so, you end up being ~3 times *faster* > than calculating the normal checksum. > > Because the interleaved streams may look quite similar > (think bitmaps), you probably want to "salt" them. A simple > rotate or XOR might do - but I'm not an expert on this. > the goal is to end up with 4 reasonably independent > streams, hence sub-hashes. > > So, the full sequence would look something like this: > > * Split text T into interleaved sub-streams T1,..,T4 > * Salt them S1 = salt(T1, 1), ..., S4 = salt(T4, 4) > * Calculate sub-stream hashes using bit-sliced code > D1, ..., D4 = sha1_4x(S1, ..., S4) = sha1(S1), ..., sha1(S4) > * Calculate the final checksum D = sha2(D1|...|D4) > > Not only would that solve the current sha1 issue but > neatly address the fact that nowadays we can read > data faster from disk that we could checksum it.
For some definition of "solve" — i.e., until a more generic attack method is invented. :) -- Brane