[
https://issues.apache.org/jira/browse/KUDU-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241736#comment-17241736
]
Alexey Serbin commented on KUDU-3217:
-------------------------------------
A few more references on the fix for the related issue in SSSD:
* Bug report in RedHat bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=1461462
* Corresponding published errata:
https://access.redhat.com/errata/RHEA-2018:0929
* SSSD 1.16 release notes (the fix is in 1.16 release):
https://sssd.io/docs/users/relnotes/notes_1_16_0.html
> Limit the amount of time spent by SASL method run by WrapSaslCall
> -----------------------------------------------------------------
>
> Key: KUDU-3217
> URL: https://issues.apache.org/jira/browse/KUDU-3217
> Project: Kudu
> Issue Type: Improvement
> Components: master, security, tserver
> Reporter: Alexey Serbin
> Priority: Major
>
> There were reports on connection negotiation issues in a secure Kudu cluster:
> Kudu clients (e.g., {{kudu cluster ksck}} tool) would fail to establish a
> connection to a tablet server or master. The issue happened rather rarely,
> but in a busy cluster that was a major nuisance because the target server
> (master or tablet server) would not accept any new connections for a very
> long time, and usually the solution was to restart the process.
> The stack traces collected from the diagnostic files pointed to a situation
> where one negotiation thread acquired the mutex in {{WrapSaslCall()}} in read
> mode and stuck for a long time (at least a few minutes), while the Kerberos
> creds renewal thread was waiting on the lock to be acquired in write mode.
> Consequently, all other connection negotiation threads were blocked on the
> same mutex after that since the mutex is of
> {{RWMutex::Priority::PREFER_WRITING}} priority.
> The stacks of the related threads looked like the following:
> {noformat}
> tids=[380992]
> 0x7f060e1e45e0 <unknown>
> 0x7f060c4afa3d __GI___poll
> 0x7f0605813ffa <unknown>
> 0x7f06058147b5 <unknown>
> 0x7f0605814b07 sss_pac_make_request
> 0x7f0605813dab <unknown>
> 0x7f060df31deb <unknown>
> 0x7f060df53661 <unknown>
> 0x7f060df5407a krb5_rd_req_decoded
> 0x7f060d3d190c <unknown>
> 0x7f060d3d2dea <unknown>
> 0x7f060d3d2f49 <unknown>
> 0x7f060d3c0606 gss_accept_sec_context
> 0x7f060765c04c <unknown>
> 0x7f060d610b9b sasl_server_step
> 0x7f060d611109 sasl_server_start
> tids=[170406,168067,167927,170163,168917,169251]
> 0x7f060e1e45e0 <unknown>
> 0x7f060e1e0092 __GI___pthread_rwlock_rdlock
> 0x1d09883 kudu::rpc::WrapSaslCall()
> 0x1d0bd21 kudu::rpc::ServerNegotiation::InitSaslServer()
> 0x1d112f7 kudu::rpc::ServerNegotiation::AuthenticateBySasl()
> 0x1d13096 kudu::rpc::ServerNegotiation::Negotiate()
> 0x1d2420c kudu::rpc::Negotiation::RunNegotiation()
> 0x1cfa905 kudu::internal::Invoker<>::Run()
> 0x1e9234f kudu::ThreadPool::DispatchThread()
> 0x1e88014 kudu::Thread::SuperviseThread()
> 0x7f060e1dce25 start_thread
> 0x7f060c4ba34d __clone
> tids=[380520]
> 0x7f060e1e45e0 <unknown>
> 0x7f060e1e02ac __GI___pthread_rwlock_wrlock
> 0x1d3b737 kudu::security::(anonymous namespace)::RenewThread()
> 0x1e88014 kudu::Thread::SuperviseThread()
> 0x7f060e1dce25 start_thread
> 0x7f060c4ba34d __clone
> {noformat}
> Thread {{80992}} is the thread that acquired the mutex in read mode and stuck
> in a SASL call (the latter went through SSSD). Thread {{380520}} is the
> Kerberos creds renewal thread, trying to lock the mutex in write mode. The
> rest are connection negotiation threads trying to acquire the lock in read
> mode.
> Further investigation revealed an [issue in
> SSSD|https://github.com/SSSD/sssd/issues/4544], where the stack of the stuck
> thread looks exactly the same as the stack of {{80992}} (modulo the latter
> didn't have debug symbols to show information on every function in the stack):
> {noformat}
> #0 0x00007f29342dcdfd in poll () from /lib64/libc.so.6
> #1 0x00007f2901e722ba in sss_cli_make_request_nochecks () from
> /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
> #2 0x00007f2901e72a75 in sss_cli_check_socket () from
> /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
> #3 0x00007f2901e72e07 in sss_pac_make_request () from
> /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
> #4 0x00007f2901e71feb in sssdpac_verify () from
> /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
> #5 0x00007f29364ea3d3 in krb5int_authdata_verify () from /lib64/libkrb5.so.3
> #6 0x00007f293650b621 in rd_req_decoded_opt () from /lib64/libkrb5.so.3
> #7 0x00007f293650c03a in krb5_rd_req_decoded () from /lib64/libkrb5.so.3
> #8 0x00007f292d592b3f in kg_accept_krb5 () from /lib64/libgssapi_krb5.so.2
> #9 0x00007f292d5941fa in krb5_gss_accept_sec_context_ext () from
> /lib64/libgssapi_krb5.so.2
> #10 0x00007f292d594359 in krb5_gss_accept_sec_context () from
> /lib64/libgssapi_krb5.so.2
> #11 0x00007f292d5816d6 in gss_accept_sec_context () from
> /lib64/libgssapi_krb5.so.2
> #12 0x00007f292d7c3edc in gssapi_server_mech_step () from
> /usr/lib64/sasl2/libgssapiv2.so
> #13 0x00007f29349e5b9b in sasl_server_step () from /lib64/libsasl2.so.3
> #14 0x00007f29349e6109 in sasl_server_start () from /lib64/libsasl2.so.3
> {noformat}
> With that, given that there might be many other bugs in that path and a
> remote KDC might be very slow to respond if overloaded in a larger cluster,
> it would be great to limit the amount of time spent by the SASL call run by
> {{WrapSaslCall()}}, so if it's over the limit, the code would return
> {{Status::TimedOut()}} or {{Status::ServiceUnavailable()}} status.
> Also, it doesn't seem like a very good idea to lock a mutex and issue a SASL
> call since the latter is often turns to be a call to remote KDC.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)