Hello
This is the driver for the Security System included in Allwinner SoC A20.
Security System (SS for short) is a hardware cryptographic accelerator that
support AES/MD5/SHA1/DES/3DES/PRNG algorithms.
It could perhaps be found also on others Allwinner SoC.
This driver currently supports:
- MD5 and SHA1 hash algorithms
- AES block cipher in CBC mode with 128/196/256bits keys.
- PRNG
The driver exposes all those algorithms through the kernel cryptographic API.
The driver support both CPU driven (also named poll mode) and DMA transfer mode.
Now let's speak of the driver more technicaly.
=== SS CPU mode ===
The CPU driven mode works by copying data from source SG to a buffer via
sg_copy_to_buffer().
Then I "copy" this buffer to the device.
The result is first written to another buffer and then transferred to
destination SG via sg_copy_from_buffer().
This way is clearly non-optimal but all my tries to directly access data from
SG have fails.
I leave in the patch my last tries with kmap_atomic()
(sunxi_aes_poll_kmap_atomic()) and kmap() (sunxi_aes_poll_kmap()) if someone
could find the
issue.
Strangely, I use the same mechanism for hash algorithm and it works for it.
=== SS DMA mode ===
The DMA mode support both direction, reading data and writing results.
The DMA mode could also be better. I dont like the "wait for interrupt" in a
"do while".
For solving this uglyness, I have tried to do a better way with a waitqueue but
it decrease performance a lot.
Another way of improvement is to tune the DMA config value gived to
sunxi_dma_config().
The documentation is not clear on the usage. (A20 User manual page 170).
How to know the good value of the "source/destination wait clock cycles" ?
The NAND and SPI driver use 0x07, emac use 0x03 for both device and SDRAM.
The USB driver seems to better use it, using 0x0F for device and 0 for SDRAM.
Same problem for "source/destination block size" parameter.
Should the block size is
- the maximum queue of the device ? (32 for SS),
- block write size ?(32 bits for SS) but in this case it is redundant with
dma_config_t.xfer_type.src_data_width
- Number of block written/read by each DMA step ?
As I said, the USB driver calculate it with "(packet_size >> 2) - 1", but it
dont help for finding good value for SS.
I have tried many combinations, and the current value are the more stable and
fast ones (for the moment).
=== PRNG ===
The PRNG function is basic and I didn't take the time to find how to test it.
=== How to use it ? ===
Since the driver use the Linux cryptoAPI, the kernel will use the acceleration
without any modification.
The typical kernel use is dm_crypt.
For userspace you will need cryptodev or AF_ALG kernel interfaces and some user
space tool aware of it.
For example cryptodev and AF_ALG have both an openssl engine available.
But notes that on sunxi-3.4 sources I hit a bug (confirmed by Arokux) with AES,
AF_ALG and request larger than 163000 bytes.
So if you want to use the AES hardware acceleration in userspace (for openssl
per example), use cryptodev.
The only parameter of the module is use_dma, by default the driver do not use
DMA.
Loading the driver with use_dma=1 make the driver use DMA for all request.
Loading the driver with use_dma=2 make the driver use DMA only when it think it
is better than CPU mode (size > 1024b).
When the driver would be more stable/optimized, I will perhaps remove the
parameter use_dma.
Using or not the waitqueue for the DMA mode can only be chosen at compile time
for now. ( a #define to be commented or not)
=== How to test/bench it ? ===
I will send later the source code of my bench/tester.
It use cryptodev and compare hash/ciphers results with ones gived by openssl.
You can also use "openssl speed" and "cryptsetup benchmark".
=== How performance gain I will have ? ===
For AES raw performance, the CPU driven mode is better for request from 64 to
1024 bytes.
For more than 1024 bytes, the DMA mode is better.
Now speak about real world performance...
I think many people want this driver for dm-crypt, so I will begin with output
from cryptsetup benchmark:
Generic kernel implementation
PBKDF2-sha1 20608 iterations per second
aes-cbc 128b 11.8 MiB/s 12.3 MiB/s
aes-cbc 256b 8.0 MiB/s 7.8 MiB/s
SS in CPU driven mode
PBKDF2-sha1 34859 iterations per second
aes-cbc 128b 15.0 MiB/s 15.0 MiB/s
aes-cbc 256b 15.0 MiB/s 15.0 MiB/s
SS in DMA mode
PBKDF2-sha1 30340 iterations per second
aes-cbc 128b 19.0 MiB/s 19.2 MiB/s
aes-cbc 256b 19.6 MiB/s 19.2 MiB/s
SS in DMA mode with waitqueue
PBKDF2-sha1 34859 iterations per second
aes-cbc 128b 63.4 MiB/s 38.0 MiB/s
aes-cbc 256b 38.2 MiB/s 36.3 MiB/s
Thoses numbers seems good, but I am very disapointed by the x2 x3 gain with the
"DMA with waitqueue" mode since my own benchmark show the opposite
gain.
As I just said, thoses numbers SEEMS good BUT dm-crypt use only 512 bytes
buffer, so for the moment, the DMA is useless and you will not gain more
than 40% performance with CPU driven mode.
I have tried a simple benchmark with dd and the the result are worse.
The benchmark is as follow:
- In a tmpfs I created a 500M file which is luks formated and mounted as an
ext2 filesystem.
- Then I "dd" a 250M file in this filesystem.
- The timing results is as follow:
- Generic kernel AES implementation 23s
- CPU mode 31s
- DMA mode 37s
So I have the bad impression that the final performance gain is negative...
This impression was confirmed after reading the cryptsetup mailing list (thanks
ssvb for the link).
For be short, lots of hardware cryptographic engine are useless with dm-crypt
because it use a too small buffer (512bytes).
You could find more informations at
http://article.gmane.org/gmane.comp.hardware.beagleboard.user/58548 and
http://code.google.com/p/cryptsetup/issues/detail?id=150
An experimental patch is available for raising buffer size of dm-crypt, but it
seems that you cannot enlarge buffer size more than your hard drive
logical sector size.
So do not expect to have better performance with classic 512bytes sector hard
drive.
I have also tried to compare the output of "openssl speed" with or without the
cryptodev engine.
Strangely no major difference.
I am going mad with all those opposite results.
For MD5/SHA1 performance, the numbers are nearly the same as AES.
=== Clock problem ===
The SS use 2 clocks, ahb_ss and ss. If someone could tell me:
- Is setting sata_pll as parent of the SS clock is good ?
- Why I get always 0 when I try to know the frequency of the bus clock ahb_ss ?
=== Future improvment ===
As you have understand, there are still lots of work, particularly on DMA.
I take the opportunity of this mail for asking who are working on the DMA
engine (nobody writed at http://linux-sunxi.org/Linux_mainlining_effort),
if no one work on it, I could try to do it, since SS is a good client for it:)
The driver need to be clean from lots of debug also and I will add DES/3DES
support soon.
For the performance, if someone could try his own benchmark for confirm or
contradict my numbers.
To conclude, I want to thanks all people who have answered my questions on IRC.
--
You received this message because you are subscribed to the Google Groups
"linux-sunxi" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.