Hello,

A followup.

Tried memcached, same issue. Turns out it wasn't CAS (I think), but
something with our authentication store (Kerberos) configuration.

Some time ago we experienced problems with Kerberos incremental
replication while under load. I couldn't convince sysadmins to change
the Kerberos data back end (e.g. to LDAP), so it was reverted to
periodic full replication at five minute intervals. I finally got shell
access to one of the KDCs and watched CAS and the KDC during moderate
load testing. Sure enough, every time the KDC received/reloaded its DB,
the KDC CPU would shoot to 100% (for about 8s) and CAS service ticket
validation errors would start to increase. This perfectly matched the
simultaneous "pause" in Eden space consumption on the CAS instances (PNG
attached if this list allows). Verified against Kerberos VMs and
physical servers of varying capacity.

The issue all but went away switching to LDAP authN.

Still puzzled why an apparent login success but service ticket
validation error (even against the same system issuing the TGT). Now
that I think about it, it could be the testing harness I inherited (The
Grinder) is not picking up an authN failure of some type (since I
understand the CAS server is not supposed to return until it receives a
response from the authN infrastructure). Time to review, I guess.

Tom.

On May 30, 2013, at 8:54 PM, Tom Poage <[email protected]> wrote:
> Evening,
>
> Question on experiences with replication reliability.
>
> I'm doing a bit of 'burn-in' testing of a new pair of CAS servers
> (3.5.2, Ehcache, RMI replication).
>
> The testing loops in a single thread on randomized loginids from a
> pool of 20k accounts, submitting a login POST to a random node of the
> pair, waits a little bit (50ms), then submits the resulting service
> ticket to its companion node. This generates about 7.5 authentication
> + service ticket validation transactions per server per second.
>
> So I get an ST validation failure on the companion node in about 0.3%
> (3 in 1000) of the cases.
>
> The service ticket cache is set to (the default) synchronous
> replication + multicast on the RHEL 6 (VMware) VMs, Oracle Java 7, no
> JVM tuning, Tomcat 6. The servers themselves are spec'd fairly small
> (1 GB, 1 CPU) when compared to our existing physical CAS production
> servers.
>
> Before I try to dive into what might be a proverbial haystack, is the
> occasional 'loss' (or delay) of a service ticket considered
> acceptable? If so, at what rate? For a worst-case scenario (i.e. a
> fast CAS client), is 50ms realistic?
>
> Thanks!
> Tom.


-- 
You are currently subscribed to [email protected] as: 
[email protected]
To unsubscribe, change settings or access archives, see 
http://www.ja-sig.org/wiki/display/JSG/cas-user

<<attachment: VM.png>>

Reply via email to