[ 
https://issues.apache.org/jira/browse/KUDU-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881448#comment-15881448
 ] 

Todd Lipcon commented on KUDU-1897:
-----------------------------------

Example negotiation:

{code}
0223 14:31:57.549678 (+   940us) server_negotiation.cc:243] Received 
SASL_INITIATE NegotiatePB request
0223 14:31:57.549681 (+     3us) server_negotiation.cc:613] Received 
SASL_INITIATE request from client
0223 14:31:57.549681 (+     0us) server_negotiation.cc:624] Client requested to 
use mechanism: GSSAPI
0223 14:31:57.549684 (+     3us) server_negotiation.cc:637] Calling 
sasl_server_start()
0223 14:32:00.327515 (+2777831us) server_negotiation.cc:255] Sending 
SASL_CHALLENGE NegotiatePB response
0223 14:32:00.328662 (+  1147us) server_negotiation.cc:243] Received 
SASL_RESPONSE NegotiatePB request
0223 14:32:00.328663 (+     1us) server_negotiation.cc:692] Received 
SASL_RESPONSE request from client
0223 14:32:00.328664 (+     1us) server_negotiation.cc:702] Calling 
sasl_server_step()
0223 14:32:00.778349 (+449685us) server_negotiation.cc:255] Sending 
SASL_CHALLENGE NegotiatePB response
0223 14:32:00.778711 (+   362us) server_negotiation.cc:243] Received 
SASL_RESPONSE NegotiatePB request
0223 14:32:00.778713 (+     2us) server_negotiation.cc:692] Received 
SASL_RESPONSE request from client
...
Metrics: 
{"mutex_wait_us":3152467,"negotiator.queue_time_us":99,"tcmalloc_contention_cycles":304896,"thread_start_us":92,"threads_started":1}
{code}

The threads are all blocked in SASL:
{code}
#4  0x00000000019bfb59 in kudu::rpc::SaslMutexLock(void*) ()
#5  0x00007f94324a469a in ?? () from /usr/lib64/sasl2/libgssapiv2.so
#6  0x000000301b40f8b1 in sasl_server_step () from /usr/lib64/libsasl2.so.2
#7  0x000000301b410aa4 in sasl_server_start () from /usr/lib64/libsasl2.so.2
#8  0x00000000019c0872 in kudu::rpc::WrapSaslCall(sasl_conn*, std::function<int 
()()> const&) ()
#9  0x00000000019c5b33 in 
kudu::rpc::ServerNegotiation::HandleSaslInitiate(kudu::rpc::NegotiatePB const&) 
()
{code}

The one thread holding the lock is loading the SASL replay cache:

{code}
#0  0x000000301640ef4d in lseek64 () from /lib64/libpthread.so.0
#1  0x000000301ac7d086 in krb5_rc_io_mark () from /lib64/libkrb5.so.3
#2  0x000000301ac7c6eb in ?? () from /lib64/libkrb5.so.3
#3  0x000000301ac7cf35 in ?? () from /lib64/libkrb5.so.3
#4  0x000000301ac789f1 in krb5_get_server_rcache () from /lib64/libkrb5.so.3
#5  0x000000301b01eacb in ?? () from /lib64/libgssapi_krb5.so.2
#6  0x000000301b01ef8b in ?? () from /lib64/libgssapi_krb5.so.2
#7  0x000000301b01f4fb in ?? () from /lib64/libgssapi_krb5.so.2
#8  0x000000301b00e31b in gss_add_cred () from /lib64/libgssapi_krb5.so.2
#9  0x000000301b00e98d in gss_acquire_cred () from /lib64/libgssapi_krb5.so.2
#10 0x00007f94324a479c in ?? () from /usr/lib64/sasl2/libgssapiv2.so
#11 0x000000301b40f8b1 in sasl_server_step () from /usr/lib64/libsasl2.so.2
{code}

Looking at the SASL library, it appears that the troublesome lock is this one:
{code}
            GSS_LOCK_MUTEX(params->utils);
            maj_stat = gss_acquire_cred(&min_stat,
                                        text->server_name,
                                        GSS_C_INDEFINITE,
                                        GSS_C_NO_OID_SET,
                                        GSS_C_ACCEPT,
                                        &text->server_creds,
                                        NULL,
                                        NULL);
            GSS_UNLOCK_MUTEX(params->utils);
{code}

Newer SASL (2.1.26) offers a way to pass in the GSS_CREDS structure from the 
outside, so we could avoid this issue, but this version/feature isn't available 
on el6. So, we probably need some workaround for this.

> GSSAPI negotiation single-threaded and very slow under concurrency
> ------------------------------------------------------------------
>
>                 Key: KUDU-1897
>                 URL: https://issues.apache.org/jira/browse/KUDU-1897
>             Project: Kudu
>          Issue Type: Bug
>          Components: rpc, security
>    Affects Versions: 1.3.0
>            Reporter: Todd Lipcon
>            Priority: Critical
>
> I'm running an Impala workload with ~10 concurrent clients submitting queries 
> that take a few seconds each on a 9 node cluster. The master is pegging on 
> core in negotiation, and has a bunch of other negotiation threads blocked on 
> a mutex within SASL.
> Each negotiation is taking 2-3 seconds due to waiting on this lock.
> This severely restricts connection throughput, and probably needs to be 
> addressed for 1.3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to