>>> GCC135 is a Power9 machine. Benchmarking on the machine shows >>> performance is off. For example, here are some numbers for AES in ECB >>> mode: >>> >>> GCC112 (Linux, ppc64le, 3.7 GHz, GCC 8.2): >>> * 1.12 cpb, 2851 MB/s >>> >>> GCC119 (AIX, ppc64be, 4.1 GHz, GCC 8.2): >>> * 0.54 cpb, 7242 MB/s
Just in case, AIX is misreporting resource usage (getrusage), and presented result appears to be aligned with the said misrepresentation. I mean if you try to calculate performance based on the skewed getrusage values, you'll customarily observe ~2 "better" cycles-per-byte results. In essence you should either use wallclock or disregard AIX results. But don't ask me why AIX does it, as I don't know... >>> GCC135 (Linux, ppc64le, 3.8 GHz, GCC 8.3): >>> * 1.94 cpb, 1815 MB/s >> >> What source code did you use for this test? > > I used Crypto++ (https://github.com/weidai11/cryptopp) for the test. > > I also spoke with Andy Polyakov. OpenSSL is observing the same issue. As already said POWER9 is "allergic" to mixing scalar and vector instructions. And since you will always have scalar instructions in the mix, most notably to calculate effective addresses, vector code is effectively bound to perform suboptimally. It's just the way POWER9 is, and complaining about it would be like complaining about weather. What one can do is to calculate as much effective addresses as possible in advance and group those instructions, as opposite to spreading them throughout loop. And of course, if you rely on compiler intrinsics, you are at compiler's mercy, and is not exactly in position to control effective address calculations (and complain ;-). >>> All algorithms show a similar slowdown. SHA is so slow I am >>> considering disabling in-core crypto for SHA and going back to the >>> integer unit. While IBM screwed up vector-scalar mix, they did improve scalar performance in POWER9, significantly[!]. So that the gap between between vector and equivalent scalar implementations gets reduced from both ends. I mean vector is slower, and scalar is faster, both in comparison to POWER8 that is. And it's more so from scalar end. Properly optimized (a.k.a. hand-written) vector SHA is faster than scalar, but by mere 12%, so that if you let compiler calculate effective addresses, it shouldn't come as surprise if vector turns out slower than scalar. And again, it's just the way POWER9 is, just accept it. >>> What is different about GCC135? Is the Power9 hardware really that slow? >> >> Generally no: https://www.ibm.com/downloads/cas/K90RQOW8 One should make distinction between multi-user/capacity suite-specific benchmarks (like SPEC/SAP/Oracle/etc.) and cryptographic primitive cycles-per-byte benchmarks for single thread on idle system. In addition cryptographic algorithms are to certain degree special case even in single-thread context, because they customarily have relatively short dependencies between steps, which results on all kinds of special and non-formalize-able relations between compiler (or assembler programmer) and hardware :-) In other words better SPECrates are not guaranteed indication of better cycles-per-byte for crypto primitives. Cheers. _______________________________________________ cfarm-users mailing list [email protected] https://lists.tetaneutral.net/listinfo/cfarm-users
