Re: [Freeipa-users] Needs help understand this timeout issue

Troels Hansen Wed, 08 Feb 2017 05:31:40 -0800

Cache is verified valid by looking at the cache files /var/lib/sss/db/ ldb 
files.


Also, if I lookup the user on the IPA server I get a fast response.

Looking up the user on a client which have a valid cache return the user within 
a few ms or secs.
Invalidating the cache on the client with sss_cache results in the user being 
fetched from the IPA server. This takes about 15-20 seconds.
This is acceptable.
If I do a tcpdump of ldap traffic while this happens I only see traffic between 
sssd client and ipa server.

Then I remove the cache files and restarts sssd on the client, simulating a 
newly joined client or a client that have cleaned up its ldb cache 
(ldap_purge_cache_timeout=0 only set on the server).
At this point the IPA server still have a valid cache.

Looking up the user while tcpdumping on IPA server I can see what looks like 
IPA fetches the entire information from AD.
This is ~730 groups, mostly nested groups for this user.

This takes 5-7 minutes........


The LDAP responds quickly, no CPU load, and I can do manual LDAP lookups then 
the user is fetched, so nothing that points to a busy LDAP.

It doesn't matter if I do a lookup of the user or he logs in. Its the same 
behavior.

Currently its very limited load, but this is going to scale up later on. 


----- On Feb 8, 2017, at 2:12 PM, Sullivan, Daniel [CRI] 
[email protected] wrote:

> Are you actually logging in or or just doing a lookup on a user?  I remember
> reading somewhere that groups are always re-evaluated at the point of login,
> regardless of what is in the cache.  I am not sure if this is accurate or the
> implications of whether or not it is on the client, server or both, but this
> could be what is happening.
> 
> Also, how are verifying that the group in the cache on the server is valid?  
> If
> you do a lookup a user on the server after the mapped memory cache expires
> (i.e. if you restart sssd), do you see a log entry that a valid cache entry 
> has
> been found and that entry is being returned (from memory it is in the nsss
> logs)?  It might help you to focus on the sysdb cache by setting a very low
> timeout value for the in memory mapped cache and or negative cache timeouts.
> Based on my experiencif a valid cache entry in the sysdb *NOT* found, a log
> entry is not returned (just a lookup is performed), which for an untrained eye
> makes it difficult to assess when timeouts are occurring in the sysdb cache.
> 
> Also, five minutes is a ridiculously long time to have to wait, have you
> inspected the performance of your 389 server?  Is your environment large or
> under large load?  Based on experience and another thread from this list, I
> have seen the 389ds threads get tied up due to extop operations because AD
> lookups via SSSD eventually will be blocked sssd under high load.
> 
> I had to significantly scale 389ds worker threads and processor cores to
> accommodate large numbers of concurrent AD lookups, this improved the
> performance of our environment and solved many issues.
> 
> Dan
> 
>> On Feb 8, 2017, at 1:27 AM, Troels Hansen <[email protected]> wrote:
>> 
>> No, ignore_group_members option is already set.
>> Tried setting and removing it on the client but didn't make a huge different.
>> 
>> Is a client have an completely empty cache, newly joined, empty 
>> /var/lig/sssd/db
>> and mc cache or something, the IPA server ALWAYS asks the AD for group info,
>> despite having a valid cache.
>> 
>> This leads to a delay of up to, or even more 5 minutes before a user can 
>> login.
>> Being denied access at first, but logging again later with success.
>> 
>> If the client have a cache thats only expired its able to pull the account 
>> info
>> from the cache on the IPA, without the IPA having to contact the AD.
>> 
>> 
>> ----- On Feb 6, 2017, at 2:24 PM, Sullivan, Daniel [CRI]
>> [email protected] wrote:
>> 
>>> Have you looked at the ignore_group_members option?  Maybe this is the 
>>> problem
>>> you are seeing?
>>> 
>>> https://jhrozek.wordpress.com/2015/08/19/performance-tuning-sssd-for-large-ipa-ad-trust-deployments/
>>> 
>>> ==snip==
>>> 
>>> ignore_group_members
>>> Normally the most data-intensive operation is downloading the groups 
>>> including
>>> their members. Usually, we are interested in what groups a user is a member 
>>> of
>>> (id aduser@ad_domain) as the initial step rather than what members do 
>>> specific
>>> groups include (getent group adgroup@ad_domain). Setting the
>>> ignore_group_members option to True makes all groups appear as empty, thus
>>> downloading only information about the group objects themselves and not 
>>> their
>>> members, providing a significant performance boost. Please note that id
>>> aduser@ad_domain would still return all the correct groups!
>>> ==snip==
>>> 
>>> 
>>> Dan
>>> 
>>> On Feb 6, 2017, at 1:59 AM, Troels Hansen <[email protected]> wrote:
>>> 
>>> Hi
>>> 
>>> I'm aware of the anatomy of how the lookup is done, but I would suppose a 
>>> valid
>>> cache on the IPA server would result in the cache from the IPA server being
>>> used?
>>> 
>>> I have been debugging this issue some more, and can confirm is the client 
>>> have
>>> its sssd cache invalidated by "sss_cache -E" and the IPA server have a valid
>>> cache, when the client asks for the user, the IPA server still asks the AD 
>>> for
>>> the entire group info?
>>> 
>>> I can see, that even though the cache is refreshed the attribute
>>> initgrExpireTimestamp (in the ldb cache) isn't updated.
>>> I have been unable to find out exactly what this controls?
>>> 
>>> lastUpdate and dataExpireTimestamp is updated to the time stamp of when the
>>> refresh ran.
>>> 
>>> 
>>> ----- On Feb 1, 2017, at 2:27 PM, Sullivan, Daniel [CRI]
>>> [email protected] wrote:
>>> 
>>> Have you checked to see if the user is expired in the cache, or if it is
>>> impacted by entry_cache_nowait_percentage (ref sssd.conf). The default entry
>>> timeout is only 90 minutes and entry_cache_nowait_percentage default is 50.
>>> 
>>> ldbsearch -H  /var/lib/sss/db/timestamps_xxx.xxx.uchicago.edu.ldb >
>>> ~/timestamps.txt
>>> ldbsearch -H  /var/lib/sss/db/cache_xxx.xxx.uchicago.edu.ldb > ~/cache.txt
>>> 
>>> Might be able to provide more info.
>>> 
>>> Again, I’d really familiarize yourself with the anatomy of an sssd lookup, 
>>> if
>>> you get to know the diagram and steps 1-7 here it will be a big help to your
>>> question(s).
>>> https://jhrozek.wordpress.com/2015/03/11/anatomy-of-sssd-user-lookup/
>>> 
>>> Dan
>>> 
>>> 
>>> On Feb 1, 2017, at 4:32 AM, Troels Hansen <[email protected]> wrote:
>>> 
>>> From looking af at TCP dump, I can see that if a client requests a AD user 
>>> from
>>> IPA, IPA does a full user lookup in AD, even though the IPA server have the
>>> user in local cache?
>>> 
>>> It looks like a single group generates a LOT of traffic to AD like:
>>> client -> IPA
>>> IPA -> client
>>> IPA -> AD
>>> AD -> IPA
>>> IPA -> AD
>>> IPA -> Second AD
>>> Second AD -> IPA
>>> IPA -> Second AD
>>> IPA -> AD
>>> AD -> IPA
>>> IPA -> AD
>>> IPA -> client
>>> client -> IPA
>>> IPA -> Client
>>> 
>>> Something similar continues for every group the user has.
>>> 
>>> Soo, I guess the question that I haven't been able to find documented 
>>> anywhere.
>>> Isn't the SSSD group cache on the IPA servers supposed to be used then a 
>>> sssd
>>> client requests a user?
>>> 
>>> 
>>> 
>>> ----- On Feb 1, 2017, at 9:53 AM, Troels Hansen [email protected] wrote:
>>> 
>>> Hmm, suspect its happening on the server...... thous I haven't been able to
>>> pinpoint a log entry that confirms my suspecting.
>>> 
>>> I have pinpointed the timeout to happen after 58 seconds after completely
>>> removing the SSSD cache and restaring SSSD, which leads me to think my 
>>> issue is
>>> related to ldap_enumaration_search_timeout which defaults to 60.
>>> 
>>> rm -fr /var/lib/sss/{mc,db}/* && systemctl restart sssd && time id shja
>>> 
>>> However, I'm unable to crank it up on the server as it seems to be adjusted 
>>> Down
>>> to 60 Again?
>>> 
>>> Wed Feb  1 09:40:56 2017) [sssd[be[lx.dr.dk]]] [dp_get_options] (0x0400): 
>>> Option
>>> ldap_enumeration_search_timeout has value 120
>>> (Wed Feb  1 09:40:56 2017) [sssd[be[lx.dr.dk]]] [dp_copy_options_ex] 
>>> (0x0400):
>>> Option ldap_enumeration_search_timeout has value 60
>>> (Wed Feb  1 09:40:56 2017) [sssd[be[lx.dr.dk]]] [dp_copy_options_ex] 
>>> (0x0400):
>>> Option ldap_enumeration_search_timeout has value 60
>>> 
>>> LDAP seems speedy enough, not timeouts while querying it manually while a 
>>> client
>>> is doing a user lookup.
>>> 
>>> ----- On Jan 30, 2017, at 6:06 PM, Sullivan, Daniel [CRI]
>>> [email protected] wrote:
>>> 
>>> 
>>> If the timeout is occurring on the server, I would start by increasing one 
>>> or
>>> both of these values:
>>> 
>>> ldap_opt_timeout
>>> ldap_search_timeout
>>> 
>>> If that doesn’t work I’d take look to see if the 389 server is under high 
>>> load
>>> when you are performing this operation.  The easiest way I have found to do
>>> this is to just execute an LDAP query directly against the IPA server when 
>>> you
>>> are performing an id lookup, for example:
>>> 
>>> ldapsearch -D "cn=Directory Manager" -w <pw> -s base -b "cn=config"
>>> "(objectclass=*)”
>>> 
>>> If the LDAP server is not responsive you probably need to increase the 
>>> number of
>>> worker threads for 389ds.  Also, you might want to disable referrals, check 
>>> out
>>> the man pages for this;
>>> 
>>> ldap_referrals = false
>>> 
>>> Also, FWIW, if you crank up debug logging on the sssd client, you should be 
>>> able
>>> to see the amount of seconds of timeout assigned to the operation, and 
>>> witness
>>> the fact that the operation is actually timing out by inspecting the logs
>>> themselves.  The logs can get a little verbose but the data is there.
>>> 
>>> 
>>> --
>>> Manage your subscription for the Freeipa-users mailing list:
>>> https://www.redhat.com/mailman/listinfo/freeipa-users
>>> Go to http://freeipa.org for more info on the project
>>> 
>>> --
>>> Med venlig hilsen
>>> 
>>> Troels Hansen
>>> 
>>> Systemkonsulent
>>> 
>>> Casalogic A/S
>>> 
>>> 
>>> T (+45) 70 20 10 63
>>> 
>>> M (+45) 22 43 71 57
>>> 
>>> Red Hat, SUSE, VMware, Citrix, Novell, Yellowfin BI, EnterpriseDB, Sophos og
>>> meget mere.
>>> 
>>> --
>>> Manage your subscription for the Freeipa-users mailing list:
>>> https://www.redhat.com/mailman/listinfo/freeipa-users
>>> Go to http://freeipa.org for more info on the project
>>> 
>>> --
>>> Med venlig hilsen
>>> 
>>> Troels Hansen
>>> 
>>> Systemkonsulent
>>> 
>>> Casalogic A/S
>>> 
>>> 
>>> T (+45) 70 20 10 63
>>> 
>>> M (+45) 22 43 71 57
>>> 
>>> Red Hat, SUSE, VMware, Citrix, Novell, Yellowfin BI, EnterpriseDB, Sophos og
>>> meget mere.
>> 
>> --
>> Med venlig hilsen
>> 
>> Troels Hansen
>> 
>> Systemkonsulent
>> 
>> Casalogic A/S
>> 
>> 
>> T (+45) 70 20 10 63
>> 
>> M (+45) 22 43 71 57
>> 
>> Red Hat, SUSE, VMware, Citrix, Novell, Yellowfin BI, EnterpriseDB, Sophos og
> > meget mere.

-- 
Med venlig hilsen 

Troels Hansen 

Systemkonsulent 

Casalogic A/S 


T (+45) 70 20 10 63 

M (+45) 22 43 71 57 

Red Hat, SUSE, VMware, Citrix, Novell, Yellowfin BI, EnterpriseDB, Sophos og 
meget mere.

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project

Re: [Freeipa-users] Needs help understand this timeout issue

Reply via email to