hi Tony,

On 04/12/20 11:12, Tony He wrote:
Hi Jan,

>what HW engine is this?  I think your best bet is to actually get the engine to support GCM; with AES and SHA acceleration in place there is very little to stop the HW engine from not being able to support GCM.. The HW engine is a part of SoC al314. It connects with A15 CPU via PCI in SoC. Chip vendor will not support GCM due to all kinds of reasons.

ah pity, and the source code to this HW engine is closed source?
>the numbers do suggest some form of cryptodev acceleration - can you unload the cryptodev module or block access to it (e.g. chmod 000 /dev/crypto) ? In my second set of test numbers, I uploaded the cryptodev moduled. You can see the CCM performance is almost same.

actually, I see the same on my i5-6800 with OpenSSL 1.0.2m but NOT with OpenSSL 1.1.1g; this leads me to believe that CCM support in the openssl 1.0.x speed command is screwed up.   It will be worthwhile to build openssl 1.1.1 for the AL314 just to see if aes-128-ccm is a viable option or not.

JJK

Jan Just Keijser <janj...@nikhef.nl <mailto:janj...@nikhef.nl>> 于2020年12月4日周五 下午5:49写道:

    Hi Tony,

    On 04/12/20 08:41, Tony He wrote:
    Hi Jan,
    Yeah, need option " -elapsed" because OpenSSL counts user time
    instead of total time(user+sys time) without this option. You can
    see:
    * aes-128-cbc and sha1 are accelerated by HW engine. I believe
    speed is faster for openvpn dco module because it uses the HW
    engine in kernel space and bypasses the path between openssl and
    cryptodev.
    that is correct the openvpn dco module sits in kernel space and
    does need to pass the userspace<->kernelspace barrier and thus
    should have better performance
    * aes-128-gcm is NOT accelerated by HW engine.
    what HW engine is this?  I think your best bet is to actually get
    the engine to support GCM; with AES and SHA acceleration in place
    there is very little to stop the HW engine from not being able to
    support GCM...
    * aes-128-ccm is NOT accelerated by HW engine but it seems that
    it is accelerated by HW instruction or other. I don't know my
    device has such function. SoC type is al314.
    the numbers do suggest some form of cryptodev acceleration - can
    you unload the cryptodev module or block access to it (e.g. chmod
    000 /dev/crypto) ?

    The AL314 is a quad core Cortex A15 CPU @ 1.7 GHz ; the numbers
    *without* cryptodev look about right for that particular CPU.

    Most modern crypto packages use AES-GCM or chacha20-poly1305 as
    they are considered more secure. CBC is considered a bit outdated
    and as far as I know no openvpn release supports CCM thus far
    (which is a shame, really).

    HTH,

    JJK



    With cryptodev: # openssl speed -evp aes-128-cbc -elapsed You
    have chosen to measure elapsed time instead of user CPU time.
    Doing aes-128-cbc for 3s on 16 size blocks: 252783 aes-128-cbc's
    in 3.00s Doing aes-128-cbc for 3s on 64 size blocks: 253044
    aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 256 size
    blocks: 251746 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on
    1024 size blocks: 190306 aes-128-cbc's in 3.00s Doing aes-128-cbc
    for 3s on 8192 size blocks: 122657 aes-128-cbc's in 3.00s
    ...................... type 16 bytes 64 bytes 256 bytes 1024
    bytes 8192 bytes aes-128-cbc 1348.18k 5398.27k 21482.33k
    64957.78k 334935.38k # openssl speed -evp aes-128-gcm -elapsed
    You have chosen to measure elapsed time instead of user CPU time.
    Doing aes-128-gcm for 3s on 16 size blocks: 3509485 aes-128-gcm's
    in 3.00s Doing aes-128-gcm for 3s on 64 size blocks: 900678
    aes-128-gcm's in 3.00s Doing aes-128-gcm for 3s on 256 size
    blocks: 228961 aes-128-gcm's in 3.00s Doing aes-128-gcm for 3s on
    1024 size blocks: 57475 aes-128-gcm's in 3.00s Doing aes-128-gcm
    for 3s on 8192 size blocks: 7189 aes-128-gcm's in 3.00s
    .................. type 16 bytes 64 bytes 256 bytes 1024 bytes
    8192 bytes aes-128-gcm 18717.25k 19214.46k 19538.01k 19618.13k
    19630.76k
    # openssl speed -evp aes-128-ccm -elapsed You have chosen to
    measure elapsed time instead of user CPU time. Doing aes-128-ccm
    for 3s on 16 size blocks: 10179383 aes-128-ccm's in 3.00s Doing
    aes-128-ccm for 3s on 64 size blocks: 10179215 aes-128-ccm's in
    3.00s Doing aes-128-ccm for 3s on 256 size blocks: 10179785
    aes-128-ccm's in 3.00s Doing aes-128-ccm for 3s on 1024 size
    blocks: 10182095 aes-128-ccm's in 3.00s Doing aes-128-ccm for 3s
    on 8192 size blocks: 10179225 aes-128-ccm's in 3.00s
    .................. type 16 bytes 64 bytes 256 bytes 1024 bytes
    8192 bytes aes-128-ccm 54290.04k 217156.59k 868674.99k
    3475488.43k 27796070.40k # openssl speed -evp sha1 -elapsed You
    have chosen to measure elapsed time instead of user CPU time.
    Doing sha1 for 3s on 16 size blocks: 95252 sha1's in 3.00s Doing
    sha1 for 3s on 64 size blocks: 95166 sha1's in 3.00s Doing sha1
    for 3s on 256 size blocks: 76177 sha1's in 3.00s Doing sha1 for
    3s on 1024 size blocks: 68799 sha1's in 3.00s Doing sha1 for 3s
    on 8192 size blocks: 53034 sha1's in 3.00s ................. type
    16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes sha1 508.01k
    2030.21k 6500.44k 23483.39k 144818.18k
    Without cryptodev:
    # openssl speed -evp aes-128-cbc -elapsed You have chosen to
    measure elapsed time instead of user CPU time. Doing aes-128-cbc
    for 3s on 16 size blocks: 9235207 aes-128-cbc's in 3.00s Doing
    aes-128-cbc for 3s on 64 size blocks: 2498066 aes-128-cbc's in
    3.00s Doing aes-128-cbc for 3s on 256 size blocks: 645288
    aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 1024 size
    blocks: 161372 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on
    8192 size blocks: 20385 aes-128-cbc's in 3.00s ................
    type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
    aes-128-cbc 49254.44k 53292.07k 55064.58k 55081.64k 55664.64k
    # openssl speed -evp aes-128-gcm -elapsed You have chosen to
    measure elapsed time instead of user CPU time. Doing aes-128-gcm
    for 3s on 16 size blocks: 3507422 aes-128-gcm's in 3.00s Doing
    aes-128-gcm for 3s on 64 size blocks: 901036 aes-128-gcm's in
    3.00s Doing aes-128-gcm for 3s on 256 size blocks: 228857
    aes-128-gcm's in 3.00s Doing aes-128-gcm for 3s on 1024 size
    blocks: 57411 aes-128-gcm's in 3.00s Doing aes-128-gcm for 3s on
    8192 size blocks: 7188 aes-128-gcm's in 3.00s ................
    type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
    aes-128-gcm 18706.25k 19222.10k 19529.13k 19596.29k 19628.03k
    # openssl speed -evp aes-128-ccm -elapsed You have chosen to
    measure elapsed time instead of user CPU time. Doing aes-128-ccm
    for 3s on 16 size blocks: 10170897 aes-128-ccm's in 3.00s Doing
    aes-128-ccm for 3s on 64 size blocks: 10167692 aes-128-ccm's in
    3.00s Doing aes-128-ccm for 3s on 256 size blocks: 10166117
    aes-128-ccm's in 3.00s Doing aes-128-ccm for 3s on 1024 size
    blocks: 10167095 aes-128-ccm's in 3.00s Doing aes-128-ccm for 3s
    on 8192 size blocks: 10172046 aes-128-ccm's in 3.00s
    ................. type 16 bytes 64 bytes 256 bytes 1024 bytes
    8192 bytes aes-128-ccm 54244.78k 216910.76k 867508.65k
    3470368.43k 27776466.94k
    openssl speed -evp sha1 -elapsed You have chosen to measure
    elapsed time instead of user CPU time. Doing sha1 for 3s on 16
    size blocks: 1877571 sha1's in 3.00s Doing sha1 for 3s on 64 size
    blocks: 1250523 sha1's in 3.00s Doing sha1 for 3s on 256 size
    blocks: 603090 sha1's in 3.00s Doing sha1 for 3s on 1024 size
    blocks: 198963 sha1's in 3.00s Doing sha1 for 3s on 8192 size
    blocks: 27380 sha1's in 3.00s ............... type 16 bytes 64
    bytes 256 bytes 1024 bytes 8192 bytes sha1 10013.71k 26677.82k
    51463.68k 67912.70k 74765.65k
    Tony

    Jan Just Keijser <janj...@nikhef.nl <mailto:janj...@nikhef.nl>>
    于2020年12月2日周三 下午11:24写道:

        Hi Tony,

        On 02/12/20 15:51, Jan Just Keijser wrote:

        On 02/12/20 15:22, Tony He wrote:
        Hi Jan,

        Welcome to join the discussion.

        >the second set of numbers doesn't make sense, and a much
        better test is to do an actual encryption test
        I don't compile cryptodev kernel module for my PC and can
        not reproduce this issue for now.  You don't understand 
        the reason why the performance is much worse with cryptodev
        module for *big* blocks, right?
        If yes, I guess the reason maybe kernel assign the work to
        multi cores while OpenSSL uses one core. Would you share
        the output of command "mpstat -P ALL 2"?

        sure, while using the cryptodev I see this:

        15:28:36     CPU    %usr   %nice    %sys %iowait    %irq  
        %soft  %steal  %guest  %gnice %idle
        15:28:38     all    1.87    0.00   23.19 0.12    0.00   
        0.00    0.00    0.00    0.00 74.81
        15:28:38       0    0.00    0.00    0.00 0.50    0.00   
        0.00    0.00    0.00    0.00 99.50
        15:28:38       1    7.00    0.00   93.00 0.00    0.00   
        0.00    0.00    0.00    0.00 0.00
        15:28:38       2    0.00    0.00    0.00 0.00    0.00   
        0.00    0.00    0.00    0.00 100.00
        15:28:38       3    0.00    0.00    0.00 0.00    0.00   
        0.00    0.00    0.00    0.00 100.00

        15:28:38     CPU    %usr   %nice    %sys %iowait    %irq  
        %soft  %steal  %guest  %gnice %idle
        15:28:40     all    0.75    0.00   24.19 0.00    0.00   
        0.00    0.00    0.00    0.00 75.06
        15:28:40       0    0.00    0.00    0.00 0.50    0.00   
        0.00    0.00    0.00    0.00 99.50
        15:28:40       1    3.50    0.00   96.50 0.00    0.00   
        0.00    0.00    0.00    0.00 0.00
        15:28:40       2    0.00    0.00    0.00 0.00    0.00   
        0.00    0.00    0.00    0.00 100.00
        15:28:40       3    0.00    0.00    0.00 0.00    0.00   
        0.00    0.00    0.00    0.00 100.00

        on a 4 core box; this means that 1 core is used 100% (which
        is what I expected).


        I suspect the main reason the cryptodev results on my
        i5-6800 go off the rails is due to this:
        (look at the "Doing aes-128-cbc lines")

        $ ./openssl speed -evp aes-128-cbc
        Doing aes-128-cbc for 3s on 16 size blocks: 2835368
        aes-128-cbc's in 1.14s
        Doing aes-128-cbc for 3s on 64 size blocks: 2720745
        aes-128-cbc's in 1.01s
        Doing aes-128-cbc for 3s on 256 size blocks: 2377830
        aes-128-cbc's in *0.74s*
        Doing aes-128-cbc for 3s on 1024 size blocks: 1538693
        aes-128-cbc's in *0.40s*
        Doing aes-128-cbc for 3s on 8192 size blocks: 370202
        aes-128-cbc's in *0.11s*
        OpenSSL 1.0.2m  2 Nov 2017
        built on: reproducible build, date unspecified
        options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int)
        aes(partial) idea(int) blowfish(idx)
        compiler: gcc -I. -I.. -I../include -DOPENSSL_THREADS
        -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DHAVE_CRYPTODEV
        -DUSE_CRYPTODEV_DIGESTS -Wa,--noexecstack -m64 -DL_ENDIAN
        -O3 -Wall -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT
        -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DRC4_ASM
        -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM
        -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM
        -DECP_NISTZ256_ASM
        The 'numbers' are in 1000s of bytes per second processed.
        type             16 bytes     64 bytes    256 bytes   1024
        bytes   8192 bytes
        aes-128-cbc      39794.64k   172403.64k 822600.65k 
        3939054.08k 27569952.58k


        The timing for how quickly the results are returned are way
        off and probably just wrong. The Openssl speed test is
        supposed to run for 3 seconds. The actual results returned
        for 8192 byte blocks is

        Doing aes-128-cbc for 3s on 8192 size blocks: 370202
        aes-128-cbc's in *0.11s*

        whereas without cryptodev I see

        Doing aes-128-cbc for 3s on 8192 size blocks: 457255
        aes-128-cbc's in *3.00s*

        So you can see that without cryptodev the i5-6800 actually
        says it's doing more blocks (457,255 vs 370,202) but with
        cryptodev it is doing it in WAY less time.  This leads me to
        believe the openssl speed code when using cryptodev just
        "goes wrong".
        It will be very interesting to see what the encryption test
        will bring - that is a much better real-life-like example
        than a simple speed test.

        as a follow-up : someone whispered in my ear (thanks, André
        ;) ) that one should use the -elapsed option for this, so
        here are new results:

        *with* cryptodev:

        ./openssl speed -evp aes-128-cbc -elapsed
        You have chosen to measure elapsed time instead of user CPU time.
        Doing aes-128-cbc for 3s on 16 size blocks: 2825786
        aes-128-cbc's in 3.00s
        Doing aes-128-cbc for 3s on 64 size blocks: 2716822
        aes-128-cbc's in 3.00s
        Doing aes-128-cbc for 3s on 256 size blocks: 2369723
        aes-128-cbc's in 3.00s
        Doing aes-128-cbc for 3s on 1024 size blocks: 1536054
        aes-128-cbc's in 3.00s
        Doing aes-128-cbc for 3s on 8192 size blocks: 369984
        aes-128-cbc's in 3.00s
        [...]
        aes-128-cbc      15,070.86k    57,958.87k 202,216.36k  
        524,306.43k  1,010,302.98k

        *without* cryptodev:

        $ openssl speed -evp aes-128-cbc -elapsed
        You have chosen to measure elapsed time instead of user CPU time.
        Doing aes-128-cbc for 3s on 16 size blocks: 207188725
        aes-128-cbc's in 3.00s
        Doing aes-128-cbc for 3s on 64 size blocks: 56855717
        aes-128-cbc's in 3.00s
        Doing aes-128-cbc for 3s on 256 size blocks: 14382122
        aes-128-cbc's in 3.00s
        Doing aes-128-cbc for 3s on 1024 size blocks: 3618996
        aes-128-cbc's in 3.00s
        Doing aes-128-cbc for 3s on 8192 size blocks: 456727
        aes-128-cbc's in 3.00s
        [...]
        aes-128-cbc    1,105,006.53k  1,212,921.96k 1,227,274.41k 
        1,235,283.97k  1,247,169.19k

        which more or less reflects the encryption test results I
        posted earlier.
        The question becomes, what are you results when using the
        -elapsed flag?

        JJK


        >My advice is to rerun your tests *without* the cryptodev
        module and then decide wheter you really need CBC+CCM hmacs.
        Yes, I confirm that without the cryptodev the performance
        is very bad for my device. I don't have that device in my
        hand right now. But I saved one aes-256-cbc result in my
        web notebook as below:

        type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
        aes-256-cbc 19626.95k 24289.71k 25054.46k 25347.75k 25337.86k
        Please note, there are two modes to accelerate
        encryption/decryption.
        1. HW instructions like intel x86 CPU.
        2. Using a crypto engine.
        When your device is 2 and its CPU is not powerful, normally
        with cryptodev speed is much faster at least for big
        blocks. Maybe for small blocks it's slower because
        it needs the time to push the work to kernel and then HW
        engine and the time spent is may longer than the time
        costed by OpenSSL directly does the encryption/decryption.
        Tony

        Jan Just Keijser <janj...@nikhef.nl
        <mailto:janj...@nikhef.nl>> 于2020年12月2日周三 下午7:24写道:

            hi Tony,

            On 01/12/20 02:50, Tony He wrote:
            Hi Arne,

            openssl speed -evp aes-128-cbc
            type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
            aes-128-cbc 20035.60k 123261.54k 267081.60k
            1094764.09k 9181370.18k
            openssl speed -evp aes-128-gcm
            type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
            aes-128-gcm 18738.76k 19284.91k 19524.44k 19606.87k
            19685.46k
            openssl speed -evp aes-128-ccm
            type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
            aes-128-ccm 53859.07k 215581.12k 862070.02k
            3460786.43k 27566347.61k
            openssl speed -evp sha1
            type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
            sha1 3108.57k 12177.79k 57325.18k 181610.34k 1207364.27k
            openssl speed -evp chacha20-poly1305
            chacha20-poly1305 is an unknown cipher or digest
            Using old openssl, so chacha20-poly1305 is not supported.

            these numbers look suspiciously like you're using the
            linux cryptodev module. Openssl speed results for the
            linux cryptodev module are totally unreliable and I'd
            even go so far as to say that the *only* numbers I
            trust in the output above are for aes-128-gcm

            For example, if I do the same on an i5-6800 I get
            *without* the cryptodev module:
              $ openssl speed -evp aes-128-cbc
              aes-128-cbc    1,104,599.38k 1,208,651.07k 
            1,231,766.70k 1,237,545.64k  1,248,793.94k

            and with the module I get
              aes-128-cbc      45,087.41k 127,822.72k  
            581,517.17k  2,256,593.19k 27,583,804.51k

            the second set of numbers doesn't make sense, and a
            much better test is to do an actual encryption test, e.g.

            *without* the module
            cat BIGFILE | openssl aes-256-cbc -e -pass 
            pass:thisisabadpassword |  pv > /dev/null
            2.93GB 0:00:05 [ 549MB/s] [ <=> ]

            ('pv' aka 'pipeview' is a handy tool to measure the
            throughput of a UNIX pipe)

            and with the module:
            cat BIGFILE | ./openssl aes-256-cbc -e -pass 
            pass:thisisabadpassword -engine cryptodev|  pv > /dev/null
            engine "cryptodev" set.
            2.93GB 0:00:07 [ 426MB/s] [ <=>

            so you see that using the cryptodev module actually
            slows things down - which is to be expected, as the
            application needs to do more work using the cryptodev
            module.

            My advice is to rerun your tests *without* the
            cryptodev module and then decide wheter you really need
            CBC+CCM hmacs.

            HTH,

            JJK


            Arne Schwabe <a...@rfc2549.org
            <mailto:a...@rfc2549.org>> 于2020年11月26日周四
            下午6:40写道:

                Am 26.11.20 um 10:41 schrieb Tony He:
                > Hi Arne,
                >
                >>Since the original thread was not on the mailing
                list I am missing your
                >>goal but if your crypto acelator already works
                with OpenSSL, then it
                >>will also work with the "normal" OpenVPN
                >
                > Yes, it wokrs with "normal" OpenVPN(OpenVPN2),
                but according to the test
                > result, it's still not fast(about 60Mbps).
                > The bottleneck is not encryption operation any
                more. It comes from the
                > switch of user space and kernel space in the
                OpenVPN2,
                > which makes the poor CPU of embedded device very
                busy. That's why we
                > need OpenVPN3 running in the kernel space.


                What numbers are we are talking in crypto speed?
                Could you provide from
                your "poor" device:


                openssl speed -evp aes-128-cbc
                openssl speed -evp aes-128-gcm
                openssl speed -evp aes-128-ccm
                openssl speed -evp sha1
                openssl speed -evp chacha20-poly1305

                I want to what difference/gain in terms of raw
                crypto speed we are
                talking here.



_______________________________________________
Openvpn-devel mailing list
Openvpn-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openvpn-devel

Reply via email to