Re: Any 386 users?

Michal Ludvig Wed, 28 Jul 2004 07:05:24 -0700

Andy Polyakov told me that:
>>>             out = in;
>>>
>>> Last line is no good. in_arg is declared const and is not expected to
>>> be modified. This is of course unfortunate, but you can't take
>>> advantage of the fact that input buffer happens to be aligned.
>>
>> I see. Hmm. So in this case I'll run the realignment loop as well.
> 
> Fixed below. Also note that I've chosen to loop even if output is
> aligned and input is not. For better cache utilization. Imagine user
> wanted to encrypt large chunk. If I copy unaligned input to aligned
> output completely, then by by time I encrypt data is *guaranteed* to be
> out of cache.


Indeed, makes sense.

>>> Have you considered to rearrange switch and loop in
>>> padlock_aes_cipher? I mean right now you have switch in loop, while
>>> you could have loop for every case in switch. It might make
>>> difference to performance...
>>
>>  
>> Do you mean "quadruplicating" the realigner code for each case?
> 
> Yes. Switches are commonly done with indirect jumps and they are more
> expensive. "Quadruplicating" the realigner results in about 700 extra
> bytes or extra 25% of binary code.

OK, but I'll move the repeating code to a macro for better readability
if you wouldn't mind.

>> Yes I can. I hope the AES_KEY structure won't be modified,
> 
> We can "ensure" this by putting explicit comment in aes.h. A.

;-)

Some questions:

1) Why don't you like uint8_t/uint32_t/...? When dealing with CPU
instructions I prefer to use these "explicit" types instead of "unsigned
int".

2) Would you mind if I reformat some parts a little bit? E.g. put { }
after else? I know this way it looks cooler, but... ;-)
+       if ((realign_in_loop=out_misaligned))
+               out = realign;
+       else    out = out_arg,
+               realign_in_loop = inp_misaligned;

3) Clearing the buffer...
+       /* Clean the realign buffer if it was used */
+       if (realign_in_loop && out_misaligned) {
+               volatile unsigned long *p=realign;
+               size_t   n=REALIGN_SIZE/sizeof(*p);
+               while (n--) *p++=0;
+       }

Using bzero() leads to inline assembler as well. Are those few less
instructions in your code worth not using the standard function?

4) While we are at the optimization talks - in the compiled (-O2)
padlock_aes_cipher() there still remain calls to memcpy(). Everything
else is inlined. The trick to convince GCC to inline memcpy():
        memcpy((long *)dst, (long *)src, len);
instead of using void* or char* pointers. Should I do it?

Anyway, thanks *a lot* for your time spent on this patch!

Michal Ludvig
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [EMAIL PROTECTED]
Automated List Manager                           [EMAIL PROTECTED]

Re: Any 386 users?

Reply via email to