Hi, I have some more information on this issue. I’m tracing it down through the slapd logs and I am continuing to struggle; I was hoping that somebody could possibly help me provided this additional information.
On the domain logs (on the DC), I see an ldap_search_ext operation 155 with a timeout of 60: (Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [sdap_get_generic_ext_step] (0x0400): calling ldap_search_ext with [(&(objectClass=ipaOverrideAnchor)(ipaAnchorUUID=:SID:S-1-5-21-xxx))][cn=Default Trust View,cn=views,cn=accounts,dc=xxx,dc=xxx,dc=uchicago,dc=edu]. (Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [sdap_get_generic_ext_step] (0x2000): ldap_search_ext called, msgid = 155 (Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [sdap_op_add] (0x2000): New operation 155 timeout 60 Then, on the slapd log for the server, I see the incoming request. Based on the result, it looks like the request completes successfully within 1 second: [20/Jan/2017:08:42:35.179890746 -0600] conn=1518 op=31 SRCH base="cn=Default Trust View,cn=views,cn=accounts,dc=xxx,dc=xxx,dc=uchicago,dc=edu" scope=2 filter="(&(objectClass=ipaOverrideAnchor)(ipaAnchorUUID=:SID:S-1-5-21-xxx))" attrs=ALL [20/Jan/2017:08:42:35.683485674 -0600] conn=1518 op=31 RESULT err=0 tag=101 nentries=0 etime=1 notes=U Then, subsequently, the domain log (on the DC), I see the same operation 155 timeout (expectedly almost a minute later). (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [sdap_op_timeout] (0x1000): Issuing timeout for 155 (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [sdap_op_destructor] (0x1000): Abandoning operation 155 (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [generic_ext_search_handler] (0x0040): sdap_get_generic_ext_recv failed [110]: Connection timed out (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [ipa_get_ad_override_done] (0x0040): ipa_get_ad_override request failed. (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [sdap_id_op_destroy] (0x4000): releasing operation connection (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [ipa_initgr_get_overrides_override_done] (0x0040): IPA override lookup failed: 110 (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [ipa_id_get_groups_overrides_done] (0x0040): IPA resolve user groups overrides failed [110]. (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [be_mark_dom_offline] (0x1000): Marking subdomain bsdad.uchicago.edu offline (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [be_mark_subdom_offline] (0x1000): Marking subdomain bsdad.uchicago.edu as inactive (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [ipa_srv_ad_acct_lookup_done] (0x0040): ipa_get_*_acct request failed: [110]: Connection timed out. (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [ipa_subdomain_account_done] (0x0040): ipa_get_*_acct request failed: [110]: Connection timed out. (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [sdap_id_op_destroy] (0x4000): releasing operation connection (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [dp_reply_std_set] (0x0080): DP Error is OK on failed request? (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [dp_req_done] (0x0400): DP Request [Account #4292]: Request handler finished [0]: Success (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [_dp_req_recv] (0x0400): DP Request [Account #4292]: Receiving request data. (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [dp_req_reply_list_success] (0x0400): DP Request [Account #4292]: Finished. Success. (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [dp_req_reply_std] (0x1000): DP Request [Account #4292]: Returning [Success]: 0,110,Success (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [dp_table_value_destructor] (0x0400): Removing [0:1:0x0001:1:1::bsdad.uchicago.edu:name=user...@domain.uchicago.edu] from reply table (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [dp_req_destructor] (0x0400): DP Request [Account #4292]: Request removed. (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [dp_req_destructor] (0x0400): Number of active DP request: 1 It seems like whatever problem I have is a communication issue between sssd on the domain controller, and ns-slapd. I’m starting to venture down the road of doing trace/packet level debug logging on the LDAP server; this could be very time consuming and frustrating (I already have it enabled, I am starting to manipulate the log settings to yield potentially valuable data), I’m hoping that somebody might be able to provide some insight based on the information above; I can definitely lookup the user on both domain controllers & both IPA servers only use themselves for IPA servers. Thank you so much for reading and for your help. Dan > On Jan 19, 2017, at 4:15 PM, Sullivan, Daniel [CRI] > <dsulliv...@bsd.uchicago.edu> wrote: > > Hi, > > I’ve received incredibly good support from this mailing list previously; I am > hoping that somebody can help me succeed in my ongoing efforts. I have spent > a few days on this at this point and I can’t seem to figure it out how to > address this issue. On my DCs I am seeing excessive ldap_search_ext and > sdap_get_generic_ext_recv timeouts created solely by the invocation of the > ‘id’ command on sssd clients. This problem seems to present itself only when > I parallelize lookups for an ‘uncached’ user (i.e. I have never performed an > initial lookup). Individual arbitrary one-off lookups for a single uncached > user on a single system almost always work fine. This leads me to believe > this is a performance tuning issue. > > We operate in an academic research computing unit (i.e. we have an HPC > cluster), and I need the ability to lookup the same user in parallel (using > the id command) across a relatively large number of systems, for example to > spawn jobs that require large amounts of CPU cores and/or memory. Right now > I am doing about 50 parallel lookups for the same user to induce this > problem. > > Here is some background information: > > 1) I have read Jakub's “Anatomy of an SSSD Lookup” as well as “Performance > Tuning of SSSD for large IPA-AD deployments”, as well as implemented > recommendations from the performance tuning doc, including moving the sssd > cache to tmpfs. > 2) We are on ipa-server 4.4.0-14.el7_3.4 using a trusted AD domain; all of > our consumed users and groups are in the AD trusted domain. We have two > domain controllers; each is a RHEL 7.3 VM with 6 GB of memory. Almost all > (if not all) of our clients are running at least sssd 1.14, and are all RHEL > 6/7. Neither DC is swapping, and both have 2 CPUs. > 3) I have tuned SSSD clients on the DCs and all clients to include these > options (the problem persists): > a) ldap_opt_timeout = 60 > b) ldap_search_timeout = 60 > 4) On both DCs, I can clear the SSSD cache, and lookup all 2000 or so users > in my environment with 40 concurrent lookups occurring locally on each DC > (using UNIX job control). I can process all 2000 lookups in this manner > without any failures (on either DC), and have ‘pre-populated’ the SSSD cache > on both DC’s by doing this. > 6) I have made no additional performance tuning changes other than what has > been described. > > Would anybody be able to advise on any potential tuning that would be > required (presumably on the DCs), to facilitate 50 parallel lookups without > experiencing sdap_get_generic_ext_recv or ldap_search_ext timeouts? Should > I be able to do this sort of thing with relative ease? I was hoping this > would be the sort of thing that would just work, but based on my relatively > extensive testing it doesn’t. Any advice anybody could provide would be > greatly appreciated. > > Thank you, > > Dan Sullivan > > -- Manage your subscription for the Freeipa-users mailing list: https://www.redhat.com/mailman/listinfo/freeipa-users Go to http://freeipa.org for more info on the project