On 09/08/2012 01:34 AM, Dmitri Pal wrote:
On 07/26/2012 09:37 AM, Sigbjorn Lie wrote:
On 07/26/2012 02:53 PM, Rob Crittenden wrote:
Sigbjorn Lie wrote:
On Wed, July 25, 2012 09:54, Sigbjorn Lie wrote:
On Tue, July 24, 2012 20:29, Simo Sorce wrote:

On Tue, 2012-07-24 at 10:22 +0200, Sigbjorn Lie wrote:


Hi,



I keep seing this error message in our production environment
"Request is a replay" in
variuos services using kerberos like ssh, sssd, automounter,
squid +++ after the upgrade to
RHEL 6.3 /
IPA
2.2.




Jul 24 10:16:11 server027 sssd_be: GSSAPI Error: Unspecified GSS
failure.  Minor code may
provide more information (Request is a replay)

Seaching google seem to suggest that this is an error with time.
However we have NTP
configured (IPA servers as NTP servers) which is synchronized to
external NTP servers. There
has been no issue before, and I cannot find issue with the time
being out of sync on the
machines where this is happening.
This error usually appears only when a same request is found in the
replay cache. It shouldn't be related to time issues, in that case
you usually get clock-skew.

Can you tell me what operation was being performed by sssd when you
caught that error ? Can you check if immediately before another
identical operation had been
performed ?

That being said, I do have 1 IPA server (out of 3) that has
significantly higher CPU usage than
the other 2, the 15-minute load average is sitting at between 0.85
and 0.95 the entire day, where
ns-slapd 389-ds process is running at 100% most of the time.

Load: 1.02, 0.94, 0.87


In comparison the other two IPA servers has a 15-minute average
between 0.10 - 0.30 throughout
the day, and the ns-slapd process is far from being such a cpu hog.

On the server having high load, running even a command such as
"ipactl status" can take up to 20
seconds to complete, where "Directory Service: RUNNING" returns
after a second or so, and to list
the rest of the services takes the remainding 19 seconds.

Also the web interface on this particular IPA server is rendered
unusable, returning "Limits
exceeded for the query" for almost any action.

Restarting all the IPA servies (ipactl restart) on the problematic
host soemwhat improves the
situation, however that particular server returns to having heavy
load quickly.

Using logconv.pl to analyze the dirsrv access log file displays
that the server in question has
the lowest search queries per min with 106 queries/min. The other
servers have 710 search
queries/sec and 168 queries/sec.

For modifications all the IPA servers has about 5-6 queries/sec.
For unindexed searches the
problematic server is the server with the lowest number. It does
however have more than twice the
amount of GSSAPI binds than the other servers with over 61000
GSSAPI binds over a 17 hour period.


The problematic server is a physical server with 2 x AMD 2.4GHz
Quad core CPU and 8GB of RAM.


This issue is also impacting all the clients, where I see random
hangs with anything involving a
ldap or kerberos query to the IPA servers.

Any suggestions?


Anyone ?

I am starting to see the Replay error when using the "ipa" CLI tool
as well, causing the request
to drop out in an error.

ipa dnsrecord-show example.com hostname
ipa: ERROR: Local error: SASL(-1): generic failure: GSSAPI Error:
Unspecified GSS failure.  Minor
code may provide more information (Request is a replay)
Sorry, I had started a reply yesterday and got side-tracked and never
sent it.

I know that feeling. :)
For the one server is busier than others, how are your clients
configured? Are you using DNS SRV records?

We use DNS SRV records for everything LDAP that does support it ->
SSSD and Linux automounter. Solaris clients, Red Hat 5 using nss_ldap,
and NetApp use statically configured machines, however this is the
second server in the server list for these machines. The primary
server got more than 7x more LDAP queries per minute, and the load on
the primary is much, much lower. All kerberos clients are using DNS
SRV for lookups, no static configuration there.

I see some hickups on the clients as well, when browsing nfs shares
(looking up UIDs), unlocking a client etc. It would seem like these
are related to the "faulty" IPA server with high load, as it seem to
respond very slowly to a lot of ldap queries too. I have tried
removing it from the DNS SRV records an hour ago, and things seem to
run smoother. A few services are still looking up there though, and
the load on the "faulty" server is still high even with fewer clients.
The primary server that's now receiving most of the queries barely
increased anything at all in CPU usage.

For the replay, are your servers running in bare metal or in VMs? How
about the clients? This sure seems like a time issue.
The time is configured as it has been for a long time. The physical
IPA servers are syncronized from external time sources, providing the
rest of the network with time. We have 2 physical servers and 1
virtual server. I have looked into the time, and it does seem like
everything is syncronized.

The amount of clients has not changed much over the last few months.

These issues started appearing just after the upgrade to RHEL 6.3 /
IPA 2.2.

Any suggestions to where to continue the troubleshooting?


Was this issue ever resolved?

I believe this is related to slow response from the krb server when binding with GSSAPI as documented in:

https://bugzilla.redhat.com/show_bug.cgi?id=845125

I'm waiting for an updated package to become available for RHEL 6.3. In the mean time I have switched the Linux automounters over to a simple bind to work around the issue.

Thanks for the follow up. :)


Rgds,
Siggi

_______________________________________________
Freeipa-users mailing list
Freeipa-users@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-users

Reply via email to