Hello, A followup.
Tried memcached, same issue. Turns out it wasn't CAS (I think), but something with our authentication store (Kerberos) configuration. Some time ago we experienced problems with Kerberos incremental replication while under load. I couldn't convince sysadmins to change the Kerberos data back end (e.g. to LDAP), so it was reverted to periodic full replication at five minute intervals. I finally got shell access to one of the KDCs and watched CAS and the KDC during moderate load testing. Sure enough, every time the KDC received/reloaded its DB, the KDC CPU would shoot to 100% (for about 8s) and CAS service ticket validation errors would start to increase. This perfectly matched the simultaneous "pause" in Eden space consumption on the CAS instances (PNG attached if this list allows). Verified against Kerberos VMs and physical servers of varying capacity. The issue all but went away switching to LDAP authN. Still puzzled why an apparent login success but service ticket validation error (even against the same system issuing the TGT). Now that I think about it, it could be the testing harness I inherited (The Grinder) is not picking up an authN failure of some type (since I understand the CAS server is not supposed to return until it receives a response from the authN infrastructure). Time to review, I guess. Tom. On May 30, 2013, at 8:54 PM, Tom Poage <[email protected]> wrote: > Evening, > > Question on experiences with replication reliability. > > I'm doing a bit of 'burn-in' testing of a new pair of CAS servers > (3.5.2, Ehcache, RMI replication). > > The testing loops in a single thread on randomized loginids from a > pool of 20k accounts, submitting a login POST to a random node of the > pair, waits a little bit (50ms), then submits the resulting service > ticket to its companion node. This generates about 7.5 authentication > + service ticket validation transactions per server per second. > > So I get an ST validation failure on the companion node in about 0.3% > (3 in 1000) of the cases. > > The service ticket cache is set to (the default) synchronous > replication + multicast on the RHEL 6 (VMware) VMs, Oracle Java 7, no > JVM tuning, Tomcat 6. The servers themselves are spec'd fairly small > (1 GB, 1 CPU) when compared to our existing physical CAS production > servers. > > Before I try to dive into what might be a proverbial haystack, is the > occasional 'loss' (or delay) of a service ticket considered > acceptable? If so, at what rate? For a worst-case scenario (i.e. a > fast CAS client), is 50ms realistic? > > Thanks! > Tom. -- You are currently subscribed to [email protected] as: [email protected] To unsubscribe, change settings or access archives, see http://www.ja-sig.org/wiki/display/JSG/cas-user
<<attachment: VM.png>>
