On 07/26/2012 02:53 PM, Rob Crittenden wrote:
Sigbjorn Lie wrote:
On Wed, July 25, 2012 09:54, Sigbjorn Lie wrote:
On Tue, July 24, 2012 20:29, Simo Sorce wrote:

On Tue, 2012-07-24 at 10:22 +0200, Sigbjorn Lie wrote:


I keep seing this error message in our production environment "Request is a replay" in variuos services using kerberos like ssh, sssd, automounter, squid +++ after the upgrade to
RHEL 6.3 /

Jul 24 10:16:11 server027 sssd_be: GSSAPI Error: Unspecified GSS failure. Minor code may
provide more information (Request is a replay)

Seaching google seem to suggest that this is an error with time. However we have NTP configured (IPA servers as NTP servers) which is synchronized to external NTP servers. There has been no issue before, and I cannot find issue with the time being out of sync on the
machines where this is happening.

This error usually appears only when a same request is found in the
replay cache. It shouldn't be related to time issues, in that case you usually get clock-skew.

Can you tell me what operation was being performed by sssd when you
caught that error ? Can you check if immediately before another identical operation had been
performed ?

That being said, I do have 1 IPA server (out of 3) that has significantly higher CPU usage than the other 2, the 15-minute load average is sitting at between 0.85 and 0.95 the entire day, where
ns-slapd 389-ds process is running at 100% most of the time.

Load: 1.02, 0.94, 0.87

In comparison the other two IPA servers has a 15-minute average between 0.10 - 0.30 throughout
the day, and the ns-slapd process is far from being such a cpu hog.

On the server having high load, running even a command such as "ipactl status" can take up to 20 seconds to complete, where "Directory Service: RUNNING" returns after a second or so, and to list
the rest of the services takes the remainding 19 seconds.

Also the web interface on this particular IPA server is rendered unusable, returning "Limits
exceeded for the query" for almost any action.

Restarting all the IPA servies (ipactl restart) on the problematic host soemwhat improves the situation, however that particular server returns to having heavy load quickly.

Using logconv.pl to analyze the dirsrv access log file displays that the server in question has the lowest search queries per min with 106 queries/min. The other servers have 710 search
queries/sec and 168 queries/sec.

For modifications all the IPA servers has about 5-6 queries/sec. For unindexed searches the problematic server is the server with the lowest number. It does however have more than twice the amount of GSSAPI binds than the other servers with over 61000 GSSAPI binds over a 17 hour period.

The problematic server is a physical server with 2 x AMD 2.4GHz Quad core CPU and 8GB of RAM.

This issue is also impacting all the clients, where I see random hangs with anything involving a
ldap or kerberos query to the IPA servers.

Any suggestions?

Anyone ?

I am starting to see the Replay error when using the "ipa" CLI tool as well, causing the request
to drop out in an error.

ipa dnsrecord-show example.com hostname
ipa: ERROR: Local error: SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure. Minor
code may provide more information (Request is a replay)

Sorry, I had started a reply yesterday and got side-tracked and never sent it.

I know that feeling. :)
For the one server is busier than others, how are your clients configured? Are you using DNS SRV records?

We use DNS SRV records for everything LDAP that does support it -> SSSD and Linux automounter. Solaris clients, Red Hat 5 using nss_ldap, and NetApp use statically configured machines, however this is the second server in the server list for these machines. The primary server got more than 7x more LDAP queries per minute, and the load on the primary is much, much lower. All kerberos clients are using DNS SRV for lookups, no static configuration there.

I see some hickups on the clients as well, when browsing nfs shares (looking up UIDs), unlocking a client etc. It would seem like these are related to the "faulty" IPA server with high load, as it seem to respond very slowly to a lot of ldap queries too. I have tried removing it from the DNS SRV records an hour ago, and things seem to run smoother. A few services are still looking up there though, and the load on the "faulty" server is still high even with fewer clients. The primary server that's now receiving most of the queries barely increased anything at all in CPU usage.

For the replay, are your servers running in bare metal or in VMs? How about the clients? This sure seems like a time issue.

The time is configured as it has been for a long time. The physical IPA servers are syncronized from external time sources, providing the rest of the network with time. We have 2 physical servers and 1 virtual server. I have looked into the time, and it does seem like everything is syncronized.

The amount of clients has not changed much over the last few months.

These issues started appearing just after the upgrade to RHEL 6.3 / IPA 2.2.

Any suggestions to where to continue the troubleshooting?


Freeipa-users mailing list

Reply via email to