On Mon, 23 Nov 2015 11:11:37 PM Alessandro Ghedini wrote:
> Is this TLS connections?

Yes, this is just measuring the TLS handshake.  Renegotiations predominately.
We deliberately didn't test the bulk symmetric crypto phase of the connection.


> I'd like to know more...

The data are a bit rough and ready but I've included what I can.  I wasn't 
directly involved in taking these measurements, so Chinese whispers are 
entirely possible.  I've been tasked with trying to find some performance 
enhancements.

    The TLS stack results are:

    stack         CPU %  connections/s
    OpenSSL         85      11,935
    atomic patch    22      16,465    proof of concept only, the stack is 
broken elsewhere
    NSS             47      46,507    !!!!!




    ----------------------------------------------------------------------
    The top ten bottlenecks identified by Sun Studio profiler.
    There are before my atomic change.

    Lock % 16   CPU % 41
     +-   EVP_MD_CTX_cleanup
       +-    EVP_PKEY_CTX_free
        +-    EVP_PKEY_free
         +-  CRYPTO_add_lock
          +-   locking_function
           +-  pthread_mutex_lock


    Lock % 21   CPU % 50
    +-   EVP_MD_CTX_copy_ex
      +-    EVP_PKEY_CTX_dup
       +-    CRYPTO_add_lock
        +-  locking_function
         +-     pthread_mutex_lock


    Lock % 5   CPU % -
    +- EVP_PKEY_CTX_dup
      +-   CRYPTO_add_lock
       +-     locking_function
        +-     pthread_mutex_lock


    Lock % 3   CPU % 3
      +-     EVP_PKEY_CTX_new
         +-    CRYPTO_add_lock
          +-   locking_function
           +-    pthread_mutex_lock


    Lock % -  CPU % 3
    EVP_PKEY_free


    Lock % 2   CPU % 2
    pkey_hmac_copy


    Lock % -   CPU % 1
     +-   Connection::destroy()
        +-  Connection::close()
          +-   NZIO_Close
             +-   nzos_Destroy_Ctx
                +-  SSL_free
                  +-   ssl_cert_free
                    +-   ssl_cert_clear_certs


    Lock % -   CPU % 2
    +-     ERR_clear_error
      +-  ERR_get_state
        +-    int_thread_get_item
          +-    lh_retrieve


    Locl % -   CPU % 1
    X509_check_private_key


    Lock % -   CPU % 1
    sha1_block_data_order_ssse3

    ----------------------------------------------------------------------

    After making all CRYPTO_add calls atomic, which breaks things elsewhere in
    the TLS stack, we obtained the above performance improvement and these new
    bottlenecks -- take these with more of a grain of salt:

    Lock % 6   CPU % 13
    SSL_new     
     +-  46.000   (7%)    nzos_Create_Ctx
      +-  38.530   (6%)    SSL_new
      +-  35.370   (6%)    CRYPTO_new_ex_data
     +-  35.310   (6%)    int_new_ex_data
      +-  34.360   (5%)    def_get_class
     +-  34.100   (5%)    CRYPTO_lock
    +-  34.020   (5%)    locking_function
    +-  32.620   (5%)    pthread_mutex_lock


    Lock % 6   CPU % 12
    SSL_SESSION_new
     +-  40.090   (6%)    d2i_SSL_SESSION
     +-  34.420   (6%)    SSL_SESSION_new
     +-  33.930   (5%)    CRYPTO_new_ex_data
     +-  33.880   (5%)    int_new_ex_data
     +-  32.630   (5%)    def_get_class
     +-  32.390   (5%)    CRYPTO_lock
     +-  32.370   (5%)    locking_function
     +-  30.860   (5%)    pthread_mutex_lock


    Lock % 9   CPU %22
    BIO_new     
    +  BIO_new                  1%
    +  BIO_set                  4%
    +  CRYPTO_new_ex_data       5%
    +  int_new_ex_data 
    +  def_get_class 
    +  CRYPTO_lock 
    +  locking_function 
    +  pthread_mutex_lock


    Lock % 11   CPU % 26
    BIO_free    
    +-  BIO_free
    +-  CRYPTO_free_ex_data
    +-  int_free_ex_data
    +-  def_get_class
    +-  CRYPTO_lock
    +-  locking_function
    +-  pthread_mutex_lock


    Lock % 17
    tls1_PRF     
    +- tls1_PRF
    +-  tls1_P_hash

    called from

    +-  62.400   (10%)    ssl3_get_finished
      +-  62.120   (10%)    ssl3_get_message
     +-  44.940   (7%)    ssl3_read_bytes
    +-  28.530   (5%)    ssl3_do_change_cipher_spec
    +-  23.410   (4%)    tls1_final_finish_mac

    or

      +-  16.300   (3%)    ssl3_take_mac
       +-  16.260   (3%)    tls1_final_finish_mac

    or

     +-  17.430   (3%)    ssl3_send_finished
     +-  15.240   (2%)    tls1_final_finish_mac

    or

    +-  65.480   (10%)    tls1_setup_key_block
      +-  62.890   (10%)    tls1_generate_key_block


    Lock % 6
    ERR_clear_error
    +-   ERR_clear_error
    +-   ERR_get_state
    +-  int_thread_get_item
    +-    lh_retrieve
     +-  getrn


    CPU % 18.93
    SSL_free


    CPU % 5.5
    SSL_SESSION_free



> As Matt mentioned, I'm curently working exactly on this. Would it be possible
> for you to share your testing method and code? I'd like to verify that my
> patches actually go in the right direction before having them merged (or maybe
> you could do your tests on top of my patches, they should mostly work fine 
> even
> if the work is not complete yet, I think).

I'm not sure I can share code and associated infrastructure at this point, we'd 
like to but we need approvals through the various internal channels.
It might be possible for us to run these tests against your patches and mail 
you the results (less than ideal but probably workable), I'd have to ask the 
engineer who did this to see if they can justify the time involved.


The bottom line is OpenSSL wants for finer grain locking, which the atomic 
operations provided.  Having locks in the various contexts would achieve the 
same result and be less platform/compiler specific.


For reference, my proof of concept atomic patch was:
diff --git a/include/openssl/crypto.h b/include/openssl/crypto.h
index 56afc51..803d7b7 100644
--- a/include/openssl/crypto.h
+++ b/include/openssl/crypto.h
@@ -220,8 +220,13 @@ extern "C" {
         CRYPTO_lock(CRYPTO_LOCK|CRYPTO_READ,type,__FILE__,__LINE__)
 #   define CRYPTO_r_unlock(type)   \
         CRYPTO_lock(CRYPTO_UNLOCK|CRYPTO_READ,type,__FILE__,__LINE__)
-#   define CRYPTO_add(addr,amount,type)    \
-        CRYPTO_add_lock(addr,amount,type,__FILE__,__LINE__)
+#   if defined(__GNUC__) || defined(__INTEL_COMPILER)
+#    define CRYPTO_add(addr,amount,type)    \
+         __sync_add_and_fetch(addr, amount)
+#   else
+#    define CRYPTO_add(addr,amount,type)    \
+         CRYPTO_add_lock(addr,amount,type,__FILE__,__LINE__)
+#   endif
 #  endif
 # else
 #  define CRYPTO_w_lock(a)
This should never be applied, it breaks things and is quick and ugly.


Regards,

Pauli

-- 
Oracle
Dr Paul Dale | Cryptographer | Network Security & Encryption 
Phone +61 7 3031 7217
Oracle Australia

_______________________________________________
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev

Reply via email to