[
https://issues.apache.org/jira/browse/KUDU-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249267#comment-17249267
]
ASF subversion and git services commented on KUDU-3217:
-------------------------------------------------------
Commit fcd737e6ccedd5a91367dc119e773f0b1904b4fa in kudu's branch
refs/heads/master from Alexey Serbin
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=fcd737e ]
[rpc] add slow timing logging for WrapSaslCall
In the context of KUDU-3217, this patch adds logging on slow SASL calls.
The reason for setting threshold of 250ms is that 250ms is more than
enough for a local call, and it should be long enough for a call to
a robust KDC in the local network.
Change-Id: I3f8cff71c35a4aab0307456b1dc4270520966ece
Reviewed-on: http://gerrit.cloudera.org:8080/16868
Reviewed-by: Grant Henke <[email protected]>
Reviewed-by: Bankim Bhavsar <[email protected]>
Tested-by: Kudu Jenkins
> Limit the amount of time spent by SASL method run by WrapSaslCall
> -----------------------------------------------------------------
>
> Key: KUDU-3217
> URL: https://issues.apache.org/jira/browse/KUDU-3217
> Project: Kudu
> Issue Type: Improvement
> Components: master, security, tserver
> Reporter: Alexey Serbin
> Priority: Major
>
> There were reports on connection negotiation issues in a secure Kudu cluster:
> Kudu clients (e.g., {{kudu cluster ksck}} tool) would fail to establish a
> connection to a tablet server or master. The issue happened rather rarely,
> but in a busy cluster that was a major nuisance because the target server
> would not accept any new connections for a very long time, and usually the
> solution was to restart the server ({{kudu-tserver}} or {{kudu-master}}
> correspondingly).
> The stack traces collected from the diagnostic files pointed to a situation
> where one negotiation thread acquired the mutex in {{WrapSaslCall()}} in read
> mode and stuck for a long time (several minutes), while the Kerberos creds
> renewal thread was waiting on the lock to be acquired in write mode.
> Consequently, all other connection negotiation threads were blocked on the
> same mutex after that since the mutex is of
> {{RWMutex::Priority::PREFER_WRITING}} priority.
> The stacks of the related threads looked like the following:
> {noformat}
> tids=[380992]
> 0x7f060e1e45e0 <unknown>
> 0x7f060c4afa3d __GI___poll
> 0x7f0605813ffa <unknown>
> 0x7f06058147b5 <unknown>
> 0x7f0605814b07 sss_pac_make_request
> 0x7f0605813dab <unknown>
> 0x7f060df31deb <unknown>
> 0x7f060df53661 <unknown>
> 0x7f060df5407a krb5_rd_req_decoded
> 0x7f060d3d190c <unknown>
> 0x7f060d3d2dea <unknown>
> 0x7f060d3d2f49 <unknown>
> 0x7f060d3c0606 gss_accept_sec_context
> 0x7f060765c04c <unknown>
> 0x7f060d610b9b sasl_server_step
> 0x7f060d611109 sasl_server_start
> tids=[170406,168067,167927,170163,168917,169251]
> 0x7f060e1e45e0 <unknown>
> 0x7f060e1e0092 __GI___pthread_rwlock_rdlock
> 0x1d09883 kudu::rpc::WrapSaslCall()
> 0x1d0bd21 kudu::rpc::ServerNegotiation::InitSaslServer()
> 0x1d112f7 kudu::rpc::ServerNegotiation::AuthenticateBySasl()
> 0x1d13096 kudu::rpc::ServerNegotiation::Negotiate()
> 0x1d2420c kudu::rpc::Negotiation::RunNegotiation()
> 0x1cfa905 kudu::internal::Invoker<>::Run()
> 0x1e9234f kudu::ThreadPool::DispatchThread()
> 0x1e88014 kudu::Thread::SuperviseThread()
> 0x7f060e1dce25 start_thread
> 0x7f060c4ba34d __clone
> tids=[380520]
> 0x7f060e1e45e0 <unknown>
> 0x7f060e1e02ac __GI___pthread_rwlock_wrlock
> 0x1d3b737 kudu::security::(anonymous namespace)::RenewThread()
> 0x1e88014 kudu::Thread::SuperviseThread()
> 0x7f060e1dce25 start_thread
> 0x7f060c4ba34d __clone
> {noformat}
> Thread {{380992}} is the thread that acquired the mutex as a reader and stuck
> in a SASL call (the latter went through the SSSD PAC plugin). Thread
> {{380520}} is the Kerberos creds renewal thread, trying to acquire the mutex
> as a writer. The rest are connection negotiation threads trying to acquire
> the lock as readers.
> Further investigation revealed [an issue in
> SSSD|https://github.com/SSSD/sssd/issues/4544], where the stack of the stuck
> thread looks exactly the same as the stack of {{380992}} (the latter didn't
> have debug symbols to show information on every function in the stack):
> {noformat}
> #0 0x00007f29342dcdfd in poll () from /lib64/libc.so.6
> #1 0x00007f2901e722ba in sss_cli_make_request_nochecks () from
> /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
> #2 0x00007f2901e72a75 in sss_cli_check_socket () from
> /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
> #3 0x00007f2901e72e07 in sss_pac_make_request () from
> /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
> #4 0x00007f2901e71feb in sssdpac_verify () from
> /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
> #5 0x00007f29364ea3d3 in krb5int_authdata_verify () from /lib64/libkrb5.so.3
> #6 0x00007f293650b621 in rd_req_decoded_opt () from /lib64/libkrb5.so.3
> #7 0x00007f293650c03a in krb5_rd_req_decoded () from /lib64/libkrb5.so.3
> #8 0x00007f292d592b3f in kg_accept_krb5 () from /lib64/libgssapi_krb5.so.2
> #9 0x00007f292d5941fa in krb5_gss_accept_sec_context_ext () from
> /lib64/libgssapi_krb5.so.2
> #10 0x00007f292d594359 in krb5_gss_accept_sec_context () from
> /lib64/libgssapi_krb5.so.2
> #11 0x00007f292d5816d6 in gss_accept_sec_context () from
> /lib64/libgssapi_krb5.so.2
> #12 0x00007f292d7c3edc in gssapi_server_mech_step () from
> /usr/lib64/sasl2/libgssapiv2.so
> #13 0x00007f29349e5b9b in sasl_server_step () from /lib64/libsasl2.so.3
> #14 0x00007f29349e6109 in sasl_server_start () from /lib64/libsasl2.so.3
> {noformat}
> Given that there might be many other bugs in that path and a KDC might be
> slow to respond to a particular request, it would be great to limit the
> amount of time spent by the SASL call run by {{WrapSaslCall()}}. If it's
> over the limit, the code would return {{Status::TimedOut()}} or
> {{Status::ServiceUnavailable()}} status and the client side could handle the
> response appropriately, but at least Kudu masters and tablet server would be
> able to accept new connections and handle those new requests in a timely
> manner.
> Also, it doesn't seem like a very good idea to acquire a lock and issue a
> SASL call since the latter is often turns to be a remote call.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)