Hi, Thanks for tips and pointers. As for getting off-topic, I'm the one to blame anyway. So I'm going to strip most of message and comment on points that still might be of public interest.
>> (*) BTW, did you try existing [multi-block SHA]? > > No, totally missed it! Found it now, good work! > > $ find -name 'sha*-mb*' > ./crypto/sha/asm/sha256-mb-x86_64.pl > ./crypto/sha/asm/sha1-mb-x86_64.pl > > How is an application using OpenSSL supposed to access this > functionality? Is there documentation? So far, I only found uses in > OpenSSL's own e_aes_cbc_hmac_sha*.c and no export of these symbols. Well, you have to admit that it's a bit too special to provide general-purpose interface to it. Which is why application-specific interface is provided instead, TLS-oriented one in e_aes_cbc_hmac_sha*.c. Mention of multi-block SHA was not really "go ahead and use it" kind, but rather "is it interesting?" with implied "if it is interesting, then we can discuss how to interface your application to it". Note that it's even possible to take those modules out of OpenSSL context... > You could want to add optional use of XOP there - rotates and vcmov. > For SHA-1, F() is just one vcmov and H() is vcmov/andnot/xor (see > sse-intrinsics.c above). For SHA-2, we use: > > #define Maj(x,y,z) vcmov(x, y, vxor(z, y)) > #define Ch(x,y,z) vcmov(y, z, x) As for XOP. Motto is to provide near-optimal performance with minimum code. That means that if some processor-specific optimization provides just little improvement, then it's likely to be omitted. I don't recall attempting XOP specifically in multi-block SHA256, but it was attempted in SHA1 and it wasn't impressive. I even recall XOP-rotates delivering worse performance in some case. It likely was some instruction alignment issue (at least I ran into some anomaly with ChaCha code when merely flipping order of instruction input arguments affected performance). Another case of XOP omission is plain SHA256. Point there is that execution is dominated by scalar part and reducing number of vector instruction has no effect whatsoever. Anyway, XOP is considered, but so far was not found "worthy". But it makes sense to double-check specifically multi-block SHA256... > We're also experimenting with instruction interleaving. Sometimes, > especially when running only 1 thread/core (such as on cheaper Intel > CPUs without HT, or when there's no thread-level parallelism in the > application - not our case, though), it's optimal to interleave several > SIMD computations, for even wider virtual SIMD vectors than the CPU > supports natively. e.g. for MD5 on AVX (64-bit builds only, since need > 16 registers for interleaving), we currently interleave 3 of those (so > 12 MD5's in parallel per thread). It's not uncommon that cryptographic algorithms have short dependency chains and consequently limited ILP, instruction-level parallelism. But then processors have limited resources too, and question is if those resources are sufficient to sustain the algorithmic IPL. Or rather vice versa, if processor has more resources than ILP, then resources will run underutilized. And naturally only then it makes sense to interleave instructions. Processor resources can be characterized by IPC, instructions per cycle, limit, and maximum possible improvement would be IPC/ILP. But one should remember that IPC is not just amount of execution ports, for example 4 on Haswell. Some instructions are port-specific and if algorithm uses such instructions a lot, you'll be limited by that port. Anyway, MD5 is known for its low IPL and it does make sense to interleave it (with itself or other algorithm). This doesn't apply to SHA. It has higher ILP and no contemporary processor has capacity to fully utilize this parallelism. Actually it's a bit worse in practice, because thing about multi-block is that it's limited by shifts, which are port-specific. This is why you observe virtually no difference among "desktop/server" processors. As for 4 Haswell ports. Of the 4 only 3 can execute vector instructions. So that absolutely best results can be achieved when you mix scalar integer-only and vector instructions, e.g. in addition to MD5 on AVX, mix in even scalar "thread". Well, gain would have to be divided by ratio between how many blocks vector part processes vs. how many blocks scalar parts adds. So gain would be too little to care about. So it's more of a fun fact in the context. _______________________________________________ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
