On Wed, Jan 25, 2017 at 10:58:34PM +0000, Sullivan, Daniel [CRI] wrote: > Hi, > > My apologizes for resurrecting this thread. This issue is still ongoing, at > this point we’ve been looking at it for over a week and now have more than > one staff member analyzing and trying to resolve it on a full time basis. I > have some more information that I was hoping an a seasoned IPA expert could > take a look at. At this point I am fairly certain it is a performance > tuning issue in either sssd or FreeIPA on the our domain controllers. It > looks to me like the main issue is that when looking up the same user across > a large number of nodes in parallel, all of our available ds389 threads get > blocked with '__lll_robust_lock_wait ()’ for operations involving > ipa_extdom_common.c. This usually occurs on one of our two DCs, but > occasionally on both. For example, in the attached output, out of 199 > threads in the attached output, 179 are in the status __lll_robust_lock_wait > (). All of the us...@xxx.uchicago.edu<mailto:us...@xxx.uchicago.edu> in > this attachment are the same user. > > Here is more information about this issue (some of it repeated for > convenience): > > 1. We currently have 2 domain controllers. Each has 6 processor cores and > 180 threads allocated for 389ds. We have gone through Red Hat’s performance > tuning guide for directory services made what we felt were appropriate > changes, and made additional tuning modifications to get lowered eviction > rates and high cache hit numbers for 389ds. We have approximately 220 > connections to our domain controllers (from "cn=monitor”), depending on the > test I’ve seen as many as 190 connected to a single DC. > 2. We are using an AD domain where all of our users and groups reside. > 3. I induce this by looking up a user (using the id command) on a large > number of nodes (maybe 200) for a user that has never been looked up before, > and is not cached on either the client, or on the DC. > 4. Before I induce the problem, I can lookup entries in LDAP without > delay or problem (i.e. the LDAP server is performant and responsive, I can > inspect cn=monitor or cn=config and get instantaneous results). > 5. When I do induce the issue, the LDAP server basically becomes > unresponsive (which is expected based on the attached output). Servicing a > query using the ldapsearchtool (for either cn=monitor or cn=config) can take > upwards of 1-2 minutes or longer. Eventually the LDAP server will ‘recover’, > i.e. I do not typically need to restart IPA services to get this working > again. > 6. After a lookup fails, subsequent parallel lookups succeed and return > the desired record (presumably from the cache). > 7. It appears that these failures are also characterized by a > corresponding "[monitor_hup] (0x0020): Received SIGHUP.” in the sssd log. > 8. Right before the problem occurs I see a brief spike in CPU utilization > of the ns-slapd process, then the utilization basically drops to 0 once the > threads are blocked in ns-slapd. > 9. Since we are doing computation in our IPA environment, it is important > that we can perform these types of parallel operations against our IPA > environment at the scale we are testing. > > I feel like we are either DoSing the LDAP server or the sss_be / sss_nss > processes, although I am not sure. Right now we are in the process of > deploying an additional domain controller to see if that helps with > distribution of load. If anybody could provide any sort of information with > respect addressing the issue in the attached trace I would be very grateful.
I think your observations are due to the fact that SSSD currently serializes connections from a single process. Your clients will call the extdom extended LDAP operation on the IPA server to get the information about the user from the trusted domain. The extdom plugin runs inside of 389ds and each client connection will run in a different thread. To get the information about the user from the trusted domain the extdom plugin calls SSSD and here is where the serialization will happen, i.e. all threads have to wait until the first one will get his results and the next thread can talk to SSSD. With an empty cache the initial lookup of a user and all its groups will take some time and since you used quite a number of clients all 389ds worker threads will be "busy" waiting to talk to SSSD so that it would even be hard for other request, even the ones which do not need to talk to SSSD, to get through because there are no free worker threads. To improve the situation maybe setting 'ignore_group_members=True' as described on https://jhrozek.wordpress.com/2015/08/19/performance-tuning-sssd-for-large-ipa-ad-trust-deployments/ which you already mentioned might help. Although in general not recommend depending on the size of the trusted domain (i.e. the number of users and groups in the trusted domain) enabling enumeration for SSSD on the IPA servers might help as well, see man sssd.conf for details. For the responsiveness of 389ds it might help to increase the number of worker threads, check the nsslapd-threadnumber parameter in the 389ds docs, e.g. https://access.redhat.com/documentation/en-US/Red_Hat_Directory_Server/10/html/Configuration_Command_and_File_Reference/Core_Server_Configuration_Reference.html#cnconfig-nsslapd_threadnumber_Thread_Number But with the large number of clients the clients might just use up threads in a reasonable number of worker threads. HTH bye, Sumit > > Regards, > > Dan Sullivan > > > > -- > Manage your subscription for the Freeipa-users mailing list: > https://www.redhat.com/mailman/listinfo/freeipa-users > Go to http://freeipa.org for more info on the project -- Manage your subscription for the Freeipa-users mailing list: https://www.redhat.com/mailman/listinfo/freeipa-users Go to http://freeipa.org for more info on the project