[
https://issues.apache.org/jira/browse/KUDU-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881448#comment-15881448
]
Todd Lipcon commented on KUDU-1897:
-----------------------------------
Example negotiation:
{code}
0223 14:31:57.549678 (+ 940us) server_negotiation.cc:243] Received
SASL_INITIATE NegotiatePB request
0223 14:31:57.549681 (+ 3us) server_negotiation.cc:613] Received
SASL_INITIATE request from client
0223 14:31:57.549681 (+ 0us) server_negotiation.cc:624] Client requested to
use mechanism: GSSAPI
0223 14:31:57.549684 (+ 3us) server_negotiation.cc:637] Calling
sasl_server_start()
0223 14:32:00.327515 (+2777831us) server_negotiation.cc:255] Sending
SASL_CHALLENGE NegotiatePB response
0223 14:32:00.328662 (+ 1147us) server_negotiation.cc:243] Received
SASL_RESPONSE NegotiatePB request
0223 14:32:00.328663 (+ 1us) server_negotiation.cc:692] Received
SASL_RESPONSE request from client
0223 14:32:00.328664 (+ 1us) server_negotiation.cc:702] Calling
sasl_server_step()
0223 14:32:00.778349 (+449685us) server_negotiation.cc:255] Sending
SASL_CHALLENGE NegotiatePB response
0223 14:32:00.778711 (+ 362us) server_negotiation.cc:243] Received
SASL_RESPONSE NegotiatePB request
0223 14:32:00.778713 (+ 2us) server_negotiation.cc:692] Received
SASL_RESPONSE request from client
...
Metrics:
{"mutex_wait_us":3152467,"negotiator.queue_time_us":99,"tcmalloc_contention_cycles":304896,"thread_start_us":92,"threads_started":1}
{code}
The threads are all blocked in SASL:
{code}
#4 0x00000000019bfb59 in kudu::rpc::SaslMutexLock(void*) ()
#5 0x00007f94324a469a in ?? () from /usr/lib64/sasl2/libgssapiv2.so
#6 0x000000301b40f8b1 in sasl_server_step () from /usr/lib64/libsasl2.so.2
#7 0x000000301b410aa4 in sasl_server_start () from /usr/lib64/libsasl2.so.2
#8 0x00000000019c0872 in kudu::rpc::WrapSaslCall(sasl_conn*, std::function<int
()()> const&) ()
#9 0x00000000019c5b33 in
kudu::rpc::ServerNegotiation::HandleSaslInitiate(kudu::rpc::NegotiatePB const&)
()
{code}
The one thread holding the lock is loading the SASL replay cache:
{code}
#0 0x000000301640ef4d in lseek64 () from /lib64/libpthread.so.0
#1 0x000000301ac7d086 in krb5_rc_io_mark () from /lib64/libkrb5.so.3
#2 0x000000301ac7c6eb in ?? () from /lib64/libkrb5.so.3
#3 0x000000301ac7cf35 in ?? () from /lib64/libkrb5.so.3
#4 0x000000301ac789f1 in krb5_get_server_rcache () from /lib64/libkrb5.so.3
#5 0x000000301b01eacb in ?? () from /lib64/libgssapi_krb5.so.2
#6 0x000000301b01ef8b in ?? () from /lib64/libgssapi_krb5.so.2
#7 0x000000301b01f4fb in ?? () from /lib64/libgssapi_krb5.so.2
#8 0x000000301b00e31b in gss_add_cred () from /lib64/libgssapi_krb5.so.2
#9 0x000000301b00e98d in gss_acquire_cred () from /lib64/libgssapi_krb5.so.2
#10 0x00007f94324a479c in ?? () from /usr/lib64/sasl2/libgssapiv2.so
#11 0x000000301b40f8b1 in sasl_server_step () from /usr/lib64/libsasl2.so.2
{code}
Looking at the SASL library, it appears that the troublesome lock is this one:
{code}
GSS_LOCK_MUTEX(params->utils);
maj_stat = gss_acquire_cred(&min_stat,
text->server_name,
GSS_C_INDEFINITE,
GSS_C_NO_OID_SET,
GSS_C_ACCEPT,
&text->server_creds,
NULL,
NULL);
GSS_UNLOCK_MUTEX(params->utils);
{code}
Newer SASL (2.1.26) offers a way to pass in the GSS_CREDS structure from the
outside, so we could avoid this issue, but this version/feature isn't available
on el6. So, we probably need some workaround for this.
> GSSAPI negotiation single-threaded and very slow under concurrency
> ------------------------------------------------------------------
>
> Key: KUDU-1897
> URL: https://issues.apache.org/jira/browse/KUDU-1897
> Project: Kudu
> Issue Type: Bug
> Components: rpc, security
> Affects Versions: 1.3.0
> Reporter: Todd Lipcon
> Priority: Critical
>
> I'm running an Impala workload with ~10 concurrent clients submitting queries
> that take a few seconds each on a 9 node cluster. The master is pegging on
> core in negotiation, and has a bunch of other negotiation threads blocked on
> a mutex within SASL.
> Each negotiation is taking 2-3 seconds due to waiting on this lock.
> This severely restricts connection throughput, and probably needs to be
> addressed for 1.3.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)