Hello

This is the driver for the Security System included in Allwinner SoC A20.
Security System (SS for short) is a hardware cryptographic accelerator that 
support AES/MD5/SHA1/DES/3DES/PRNG algorithms.
It could perhaps be found also on others Allwinner SoC.

This driver currently supports:
- MD5 and SHA1 hash algorithms
- AES block cipher in CBC mode with 128/196/256bits keys.
- PRNG
The driver exposes all those algorithms through the kernel cryptographic API.

The driver support both CPU driven (also named poll mode) and DMA transfer mode.

Now let's speak of the driver more technicaly.

=== SS CPU mode ===
The CPU driven mode works by copying data from source SG to a buffer via 
sg_copy_to_buffer().
Then I "copy" this buffer to the device.
The result is first written to another buffer and then transferred to 
destination SG via sg_copy_from_buffer().
This way is clearly non-optimal but all my tries to directly access data from 
SG have fails.

I leave in the patch my last tries with kmap_atomic() 
(sunxi_aes_poll_kmap_atomic()) and kmap() (sunxi_aes_poll_kmap()) if someone 
could find the 
issue.
Strangely, I use the same mechanism for hash algorithm and it works for it.

=== SS DMA mode ===
The DMA mode support both direction, reading data and writing results.
The DMA mode could also be better. I dont like the "wait for interrupt" in a 
"do while".
For solving this uglyness, I have tried to do a better way with a waitqueue but 
it decrease performance a lot.

Another way of improvement is to tune the DMA config value gived to 
sunxi_dma_config().
The documentation is not clear on the usage. (A20 User manual page 170).
How to know the good value of the "source/destination wait clock cycles" ?
The NAND and SPI driver use 0x07, emac use 0x03 for both device and SDRAM.
The USB driver seems to better use it, using 0x0F for device and 0 for SDRAM.
Same problem for "source/destination block size" parameter.
Should the block size is 
- the maximum queue of the device ? (32 for SS),
- block write size ?(32 bits for SS) but in this case it is redundant with 
dma_config_t.xfer_type.src_data_width
- Number of block written/read by each DMA step ?
As I said, the USB driver calculate it with "(packet_size >> 2) - 1", but it 
dont help for finding good value for SS.
I have tried many combinations, and the current value are the more stable and 
fast ones (for the moment).

=== PRNG ===
The PRNG function is basic and I didn't take the time to find how to test it.

=== How to use it ? ===
Since the driver use the Linux cryptoAPI, the kernel will use the acceleration 
without any modification.
The typical kernel use is dm_crypt.
For userspace you will need cryptodev or AF_ALG kernel interfaces and some user 
space tool aware of it.
For example cryptodev and AF_ALG have both an openssl engine available.
But notes that on sunxi-3.4 sources I hit a bug (confirmed by Arokux) with AES, 
AF_ALG and request larger than 163000 bytes.
So if you want to use the AES hardware acceleration in userspace (for openssl 
per example), use cryptodev.

The only parameter of the module is use_dma, by default the driver do not use 
DMA.
Loading the driver with use_dma=1 make the driver use DMA for all request.
Loading the driver with use_dma=2 make the driver use DMA only when it think it 
is better than CPU mode (size > 1024b).
When the driver would be more stable/optimized, I will perhaps remove the 
parameter use_dma.

Using or not the waitqueue for the DMA mode can only be chosen at compile time 
for now. ( a #define to be commented or not)

=== How to test/bench it ? ===
I will send later the source code of my bench/tester.
It use cryptodev and compare hash/ciphers results with ones gived by openssl.

You can also use "openssl speed" and "cryptsetup benchmark".

=== How performance gain I will have ? ===
For AES raw performance, the CPU driven mode is better for request from 64 to 
1024 bytes.
For more than 1024 bytes, the DMA mode is better.

Now speak about real world performance...
I think many people want this driver for dm-crypt, so I will begin with output 
from cryptsetup benchmark:
  Generic kernel implementation
    PBKDF2-sha1        20608 iterations per second
    aes-cbc   128b    11.8 MiB/s    12.3 MiB/s
    aes-cbc   256b     8.0 MiB/s     7.8 MiB/s
  SS in CPU driven mode
    PBKDF2-sha1        34859 iterations per second
    aes-cbc   128b    15.0 MiB/s    15.0 MiB/s
    aes-cbc   256b    15.0 MiB/s    15.0 MiB/s
  SS in DMA mode
    PBKDF2-sha1        30340 iterations per second
    aes-cbc   128b    19.0 MiB/s    19.2 MiB/s
    aes-cbc   256b    19.6 MiB/s    19.2 MiB/s
  SS in DMA mode with waitqueue
    PBKDF2-sha1        34859 iterations per second
    aes-cbc   128b    63.4 MiB/s    38.0 MiB/s
    aes-cbc   256b    38.2 MiB/s    36.3 MiB/s
Thoses numbers seems good, but I am very disapointed by the x2 x3 gain with the 
"DMA with waitqueue" mode since my own benchmark show the opposite 
gain.

As I just said, thoses numbers SEEMS good BUT dm-crypt use only 512 bytes 
buffer, so for the moment, the DMA is useless and you will not gain more 
than 40% performance with CPU driven mode.
I have tried a simple benchmark with dd and the the result are worse.
The benchmark is as follow:
- In a tmpfs I created a 500M file which is luks formated and mounted as an 
ext2 filesystem.
- Then I "dd" a 250M file in this filesystem.
- The timing results is as follow:
    - Generic kernel AES implementation 23s
    - CPU mode 31s
    - DMA mode 37s
So I have the bad impression that the final performance gain is negative...
This impression was confirmed after reading the cryptsetup mailing list (thanks 
ssvb for the link).
For be short, lots of hardware cryptographic engine are useless with dm-crypt 
because it use a too small buffer (512bytes).
You could find more informations at 
http://article.gmane.org/gmane.comp.hardware.beagleboard.user/58548 and 
http://code.google.com/p/cryptsetup/issues/detail?id=150
An experimental patch is available for raising buffer size of dm-crypt, but it 
seems that you cannot enlarge buffer size more than your hard drive 
logical sector size.
So do not expect to have better performance with classic 512bytes sector hard 
drive.

I have also tried to compare the output of "openssl speed" with or without the 
cryptodev engine.
Strangely no major difference.

I am going mad with all those opposite results.

For MD5/SHA1 performance, the numbers are nearly the same as AES.

=== Clock problem ===
The SS use 2 clocks, ahb_ss and ss. If someone could tell me:
- Is setting sata_pll as parent of the SS clock is good ?
- Why I get always 0 when I try to know the frequency of the bus clock ahb_ss ?

=== Future improvment ===
As you have understand, there are still lots of work, particularly on DMA.
I take the opportunity of this mail for asking who are working on the DMA 
engine (nobody writed at http://linux-sunxi.org/Linux_mainlining_effort), 
if no one work on it, I could try to do it, since SS is a good client for it:)
The driver need to be clean from lots of debug also and I will add DES/3DES 
support soon.

For the performance, if someone could try his own benchmark for confirm or 
contradict my numbers.

To conclude, I want to thanks all people who have answered my questions on IRC.

-- 
You received this message because you are subscribed to the Google Groups 
"linux-sunxi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to