On Sun, Apr 12, 2020 at 10:20 AM Andy Polyakov via cfarm-users <[email protected]> wrote: > > >>> GCC135 is a Power9 machine. Benchmarking on the machine shows > >>> performance is off. For example, here are some numbers for AES in ECB > >>> mode: > >>> > >>> GCC112 (Linux, ppc64le, 3.7 GHz, GCC 8.2): > >>> * 1.12 cpb, 2851 MB/s > >>> > >>> GCC119 (AIX, ppc64be, 4.1 GHz, GCC 8.2): > >>> * 0.54 cpb, 7242 MB/s > > Just in case, AIX is misreporting resource usage (getrusage), and > presented result appears to be aligned with the said misrepresentation. > I mean if you try to calculate performance based on the skewed getrusage > values, you'll customarily observe ~2 "better" cycles-per-byte results. > In essence you should either use wallclock or disregard AIX results. But > don't ask me why AIX does it, as I don't know...
Yeah, we use simple wall clock via clock() calls. It seems to perform most reliably on most platforms as long as high precision is not needed. > >>> GCC135 (Linux, ppc64le, 3.8 GHz, GCC 8.3): > >>> * 1.94 cpb, 1815 MB/s > >> > >> What source code did you use for this test? > > > > I used Crypto++ (https://github.com/weidai11/cryptopp) for the test. > > > > I also spoke with Andy Polyakov. OpenSSL is observing the same issue. > > As already said POWER9 is "allergic" to mixing scalar and vector > instructions. And since you will always have scalar instructions in the > mix, most notably to calculate effective addresses, vector code is > effectively bound to perform suboptimally. It's just the way POWER9 is, > ... Yeah, I've been thinking about that. How would the following perform for loop control? # Run 10 iterations vector unsigned int x, l, s; x = vec_spalt(1); l = vec_splat(10); c = vec_spalt(1); while (vec_all_ne(x, y)) { ... x = vec_add(x, s); } The return value from vec_all_ne is an int. Will using a vector as loop control improve performance. > What > one can do is to calculate as much effective addresses as possible in > advance and group those instructions, as opposite to spreading them > throughout loop. And of course, if you rely on compiler intrinsics, you > are at compiler's mercy, and is not exactly in position to control > effective address calculations (and complain ;-). Yeah, it is a shame intrinsics are second class citizens. Clang calculates effective addresses using indexes (pointer math) while GCC and XLC use offsets (integer math). It makes it harder to follow the advice. I thought Power9 was going to make things easier due to the vector char and vector short loads, but they don't matter much when the machine runs code more slowly. > >>> All algorithms show a similar slowdown. SHA is so slow I am > >>> considering disabling in-core crypto for SHA and going back to the > >>> integer unit. > > While IBM screwed up vector-scalar mix, they did improve scalar > performance in POWER9, significantly[!]. So that the gap between between > vector and equivalent scalar implementations gets reduced from both > ends. I mean vector is slower, and scalar is faster, both in comparison > to POWER8 that is. And it's more so from scalar end. Properly optimized > (a.k.a. hand-written) vector SHA is faster than scalar, but by mere 12%, > so that if you let compiler calculate effective addresses, it shouldn't > come as surprise if vector turns out slower than scalar. And again, it's > just the way POWER9 is, just accept it. > > >>> What is different about GCC135? Is the Power9 hardware really that slow? > >> > >> Generally no: https://www.ibm.com/downloads/cas/K90RQOW8 > > One should make distinction between multi-user/capacity suite-specific > benchmarks (like SPEC/SAP/Oracle/etc.) and cryptographic primitive > cycles-per-byte benchmarks for single thread on idle system. In addition > cryptographic algorithms are to certain degree special case even in > single-thread context, because they customarily have relatively short > dependencies between steps, which results on all kinds of special and > non-formalize-able relations between compiler (or assembler programmer) > and hardware :-) In other words better SPECrates are not guaranteed > indication of better cycles-per-byte for crypto primitives. Thanks. _______________________________________________ cfarm-users mailing list [email protected] https://lists.tetaneutral.net/listinfo/cfarm-users
