On 07/26/2012 02:53 PM, Rob Crittenden wrote:
Sigbjorn Lie wrote:
On Wed, July 25, 2012 09:54, Sigbjorn Lie wrote:
On Tue, July 24, 2012 20:29, Simo Sorce wrote:
On Tue, 2012-07-24 at 10:22 +0200, Sigbjorn Lie wrote:
Hi,
I keep seing this error message in our production environment
"Request is a replay" in
variuos services using kerberos like ssh, sssd, automounter, squid
+++ after the upgrade to
RHEL 6.3 /
IPA
2.2.
Jul 24 10:16:11 server027 sssd_be: GSSAPI Error: Unspecified GSS
failure. Minor code may
provide more information (Request is a replay)
Seaching google seem to suggest that this is an error with time.
However we have NTP
configured (IPA servers as NTP servers) which is synchronized to
external NTP servers. There
has been no issue before, and I cannot find issue with the time
being out of sync on the
machines where this is happening.
This error usually appears only when a same request is found in the
replay cache. It shouldn't be related to time issues, in that case
you usually get clock-skew.
Can you tell me what operation was being performed by sssd when you
caught that error ? Can you check if immediately before another
identical operation had been
performed ?
That being said, I do have 1 IPA server (out of 3) that has
significantly higher CPU usage than
the other 2, the 15-minute load average is sitting at between 0.85
and 0.95 the entire day, where
ns-slapd 389-ds process is running at 100% most of the time.
Load: 1.02, 0.94, 0.87
In comparison the other two IPA servers has a 15-minute average
between 0.10 - 0.30 throughout
the day, and the ns-slapd process is far from being such a cpu hog.
On the server having high load, running even a command such as
"ipactl status" can take up to 20
seconds to complete, where "Directory Service: RUNNING" returns
after a second or so, and to list
the rest of the services takes the remainding 19 seconds.
Also the web interface on this particular IPA server is rendered
unusable, returning "Limits
exceeded for the query" for almost any action.
Restarting all the IPA servies (ipactl restart) on the problematic
host soemwhat improves the
situation, however that particular server returns to having heavy
load quickly.
Using logconv.pl to analyze the dirsrv access log file displays that
the server in question has
the lowest search queries per min with 106 queries/min. The other
servers have 710 search
queries/sec and 168 queries/sec.
For modifications all the IPA servers has about 5-6 queries/sec. For
unindexed searches the
problematic server is the server with the lowest number. It does
however have more than twice the
amount of GSSAPI binds than the other servers with over 61000 GSSAPI
binds over a 17 hour period.
The problematic server is a physical server with 2 x AMD 2.4GHz Quad
core CPU and 8GB of RAM.
This issue is also impacting all the clients, where I see random
hangs with anything involving a
ldap or kerberos query to the IPA servers.
Any suggestions?
Anyone ?
I am starting to see the Replay error when using the "ipa" CLI tool
as well, causing the request
to drop out in an error.
ipa dnsrecord-show example.com hostname
ipa: ERROR: Local error: SASL(-1): generic failure: GSSAPI Error:
Unspecified GSS failure. Minor
code may provide more information (Request is a replay)
Sorry, I had started a reply yesterday and got side-tracked and never
sent it.
I know that feeling. :)
For the one server is busier than others, how are your clients
configured? Are you using DNS SRV records?
We use DNS SRV records for everything LDAP that does support it -> SSSD
and Linux automounter. Solaris clients, Red Hat 5 using nss_ldap, and
NetApp use statically configured machines, however this is the second
server in the server list for these machines. The primary server got
more than 7x more LDAP queries per minute, and the load on the primary
is much, much lower. All kerberos clients are using DNS SRV for lookups,
no static configuration there.
I see some hickups on the clients as well, when browsing nfs shares
(looking up UIDs), unlocking a client etc. It would seem like these are
related to the "faulty" IPA server with high load, as it seem to respond
very slowly to a lot of ldap queries too. I have tried removing it from
the DNS SRV records an hour ago, and things seem to run smoother. A few
services are still looking up there though, and the load on the "faulty"
server is still high even with fewer clients. The primary server that's
now receiving most of the queries barely increased anything at all in
CPU usage.
For the replay, are your servers running in bare metal or in VMs? How
about the clients? This sure seems like a time issue.
The time is configured as it has been for a long time. The physical IPA
servers are syncronized from external time sources, providing the rest
of the network with time. We have 2 physical servers and 1 virtual
server. I have looked into the time, and it does seem like everything is
syncronized.
The amount of clients has not changed much over the last few months.
These issues started appearing just after the upgrade to RHEL 6.3 / IPA 2.2.
Any suggestions to where to continue the troubleshooting?
Regards,
Siggi
_______________________________________________
Freeipa-users mailing list
Freeipa-users@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-users