Andy Polyakov told me that:
>>> out = in;
>>>
>>> Last line is no good. in_arg is declared const and is not expected to
>>> be modified. This is of course unfortunate, but you can't take
>>> advantage of the fact that input buffer happens to be aligned.
>>
>> I see. Hmm. So in this case I'll run the realignment loop as well.
>
> Fixed below. Also note that I've chosen to loop even if output is
> aligned and input is not. For better cache utilization. Imagine user
> wanted to encrypt large chunk. If I copy unaligned input to aligned
> output completely, then by by time I encrypt data is *guaranteed* to be
> out of cache.
Indeed, makes sense.
>>> Have you considered to rearrange switch and loop in
>>> padlock_aes_cipher? I mean right now you have switch in loop, while
>>> you could have loop for every case in switch. It might make
>>> difference to performance...
>>
>>
>> Do you mean "quadruplicating" the realigner code for each case?
>
> Yes. Switches are commonly done with indirect jumps and they are more
> expensive. "Quadruplicating" the realigner results in about 700 extra
> bytes or extra 25% of binary code.
OK, but I'll move the repeating code to a macro for better readability
if you wouldn't mind.
>> Yes I can. I hope the AES_KEY structure won't be modified,
>
> We can "ensure" this by putting explicit comment in aes.h. A.
;-)
Some questions:
1) Why don't you like uint8_t/uint32_t/...? When dealing with CPU
instructions I prefer to use these "explicit" types instead of "unsigned
int".
2) Would you mind if I reformat some parts a little bit? E.g. put { }
after else? I know this way it looks cooler, but... ;-)
+ if ((realign_in_loop=out_misaligned))
+ out = realign;
+ else out = out_arg,
+ realign_in_loop = inp_misaligned;
3) Clearing the buffer...
+ /* Clean the realign buffer if it was used */
+ if (realign_in_loop && out_misaligned) {
+ volatile unsigned long *p=realign;
+ size_t n=REALIGN_SIZE/sizeof(*p);
+ while (n--) *p++=0;
+ }
Using bzero() leads to inline assembler as well. Are those few less
instructions in your code worth not using the standard function?
4) While we are at the optimization talks - in the compiled (-O2)
padlock_aes_cipher() there still remain calls to memcpy(). Everything
else is inlined. The trick to convince GCC to inline memcpy():
memcpy((long *)dst, (long *)src, len);
instead of using void* or char* pointers. Should I do it?
Anyway, thanks *a lot* for your time spent on this patch!
Michal Ludvig
______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List [EMAIL PROTECTED]
Automated List Manager [EMAIL PROTECTED]