Re: [PATCH 0/4] Initial POWER8 support

Marcelo Cerri Fri, 29 Nov 2013 05:05:24 -0800

Hi Andy.

On Thu, Nov 28, 2013 at 09:11:35AM +0100, Andy Polyakov wrote:
> >Any comments on that?
> 
> In one word "no-o-o-o-o-o-o". :-) In more words. Preferred way to
> integrate processor-specific code is plotted in Intel AES-NI and
> SPARC T4 modules. And "preferred" does not really mean "matter of
> choice". [s390x module is usually mentioned in the context, and the
> answer is I wish I had time to do something about it.]
>


Can you be more specific with that? What do you disagree? Is it the way
I'm checking for processor's capabilities or the fact that I included
functions to encrypt and decrypt just individual blocks? Both? Or
anything else?

Regarding block encryption, my idea is to first provide optimization
for it and then include optimization for the most common cipher modes
(CBC, CTR and so on). In that way, any cipher mode without a specific
optimization still can have some level of performance improvement.

> >>This patch series adds the initial support for POWER8 new cryptographic
> >>instructions.
> >>
> >>Different versions of the ppc_vcipher_AES_[en|de]crypt were tested and
> >>no significant performance gains where found, even using multiple vector
> >>registers to load all sub-keys in advance.
> 
> You naturally won't observe difference in single-block function.

Yes, I agree. For single block encryption the bottleneck will be the
vcipher instruction latency, and almost any implementation will perform
similar.

> Because all instructions are high latency and are dependent on each
> other, so there is a lot of "free slots" to execute all the
> collateral instructions. While it's not self-obvious that gain from
> pre-loading key schedule can be observed in single-threaded
> benchmark even in code with interleaved instructions in
> parallelizeable modes, there might be other factors to consider. The
> POWER8 processor is SMT (right?), and it should be advantageous to
> pre-load for stream operations, so that there is more memory bus
> bandwidth available to the other threads. Or it might be more
> appropriate to use the "free slots" [which will be less numerous in
> parallelizable modes] for other things, for example maintaining
> counter values in CTR...
> 
> >>Because of that, the version
> >>included in this series was chosen based on readability.
> 
> Why not folded loop then?

I was talking specifically regarding changing the order that keys are
loaded. I don't see any problem in using a loop instead.

> 
> >>The performance
> >>gain is about 5x in a non-final hardware.
> 
> More important question is what is theoretical asymptotic limit, how
> far are we from it and how to get there. Well, answer is naturally
> mode-specific subroutines, but it doesn't change the point. One
> should discuss even absolute numbers, not only relative improvement.

I understand your point. And yes, just mode-specific routines will be
able to get maximum performance from that. I can post absolute numbers,
but in the end any notion of improvement only can be obtained when
comparing it with the current assembly or C implementation results,
specially because that is not a final hardware and the results will not
reflect the performance of the final hardware.

> 
> >>The patch "perlasm/ppc-xlate.pl: vcipher instructions support" is not
> >>necessary for newer versions of GCC and I'd like to hear opinions if
> >>it's worth to include it or not.
> 
> Absolutely. And it applies to all new instructions. One can choose
> to implement module-specific instructions in module itself and
> common ones in ppc-xlate, e.g. vcipher in AES module and ldxvd2x in
> ppc-xlate.

Ok.

> 
> >>Feel free to ask me any questions regarding the code.
> 
> Doesn't one need to take care of vrsave? If it's not required on
> Linux, is it required elsewhere? [It was required on MacOS X].

You are right. I think that Linux doesn't rely on VRSAVE right now. But
it might be better to save and set it properly. I will get more
information on that I will let you know.

> 
> Is presented code endian-neutral? Manual doesn't discuss endianness
> in vcipher context, so I assume that instruction operation does not
> depend on current endianness. Which would require split endian
> operation for loading data, I assume in little-endian mode.

I think that load and store operations will need some adjustment. I can
include a non-tested code for it, but I don't have access yet to a
little-endian environment on POWER8.

> 
> As for ld/stxvd2x for data. Manual "threatens" with penalties on
> cache line and page boundaries, and it doesn't seem to actually make
> promise that it always works with byte alignment across page
> boundaries. Yes, OS surely handles it by serving the exception, but
> we don't want it to happen. Wouldn't it be more appropriate to
> adhere to l/stvx? [See just committed vpaes-ppc.pl module for
> example.]
> 
> As for page boundaries in ld/stxvd2x. Key schedule is aligned at 64
> bits (in e_aes.c) and this doesn't preclude possibility for a
> ld/stxvd2x to cross page boundary. And if there is penalty, it might
> get costly [because of recurring nature of references to key
> schedule]. Should one consider lvx even for key schedule?
> 

They work with non-aligned data but I will check if there's any issues
with page boundaries.

I'd like to include those changes in a incremental way, what do you
think? I think it would avoid wasting time with submitting huge patches
that might need to be completely redone.

Regards,
Marcelo

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       openssl-dev@openssl.org
Automated List Manager                           majord...@openssl.org

Re: [PATCH 0/4] Initial POWER8 support

Reply via email to