Brian, I can't see your architecture diagram, because I don't have a Lucid Chart account. You are correct in your assessment of why an ST could fail to validate: (1) invalid ticket (2) reused ticket (special case of #1). (3) timed out ticket (special case of #1).
Additionally, I think that if the TGT were expired prior to validating the ST, the ST would also become invalid. Since you mention you are using Hazelcast, I will assume you have a multi-CAS architecture and are propagating tickets between your nodes. In the cases where your ST validations failed, are there any instances where the ST failed to validate at the *same* CAS node where the ST was issued? If so, then it would seem to be a timeout issue (you could try increasing the ST timeout to see if that helps). If not, it may be that your STs are being validated before Hazelcast has had a chance to propagate the ticket to the peer CAS node. I am working with a multi-node CAS in pre-production using Hazelcast, and the CAS logs show something like this when Hazelcast receives a ticket from a peer: 2015-07-17 16:53:53,611 DEBUG [net.unicon.cas.addons.ticket.registry.HazelcastTicketRegistry] - Returning Ticket[ST-260-cyVBxGmIHyDfrff0nplR-cas1.dev.lafayette.edu] from the Hazelcast IMap If that entry came *after* the failed ST validation attempt, that would explain why the ST validation failed. Thanks, Carl ----- Original Message ----- From: "Bryan Wooten" <bryan.woo...@utah.edu> To: cas-dev@lists.jasig.org Sent: Monday, July 20, 2015 6:10:59 PM Subject: [cas-dev] Diagnosing Service Ticket validation errors Sorry if I am spamming this list but I am desperate. We are getting random ST validation errors from our CAS clients, both internal and SaaS applications. Results in 500 errors to the end user. On July 14th we got over 2000 of these errors out of about 30k successful logins. This led to (thanks ITIL ) awareness up to the VP level. I am under the gun to find a “solution” before the start of school August 24th. I have turned up log level to debug on the CAS servers. I see successful validations in the logs, but not unsuccessful validations. I also see ST creation in the audit log. Now if I understand how CAS works, there can only be 3 reasons an ST won’t validate: it is being reused, it has timed out or it does not exist / is corrupted. I just can’t find the actual code that validates the code and can log the EXACT reason. Can someone point me to the method(s) that does the validation? I just want to add a log.debug message at the point of failure. Today I found a validation failure that had 2 attempts, I can see when the ST was created and both attempts failed, so it wasn’t a re-use error? Other info: version 3.5.2 with Hazelcast ticket registry. I have hazelcast logging set to debug and see some transfer over port 1501. Here is a diagram of our infrastructure: https://www.lucidchart.com/invitations/accept/da009b9d-e55f-4f95-9301-e6bd23d508ab Yeah 2 Load Balancers (?). Netscape is really a Sun App Server. Why 2, because Peoplesoft can’t handle SHA-2 certs on the Netscalar. Yeah a mess. Not all failures go through the Sun App Server, but the majority do. Thanks for any help. -Bryan -- You are currently subscribed to cas-dev@lists.jasig.org as: waldb...@lafayette.edu To unsubscribe, change settings or access archives, see http://www.ja-sig.org/wiki/display/JSG/cas-dev -- You are currently subscribed to cas-dev@lists.jasig.org as: arch...@mail-archive.com To unsubscribe, change settings or access archives, see http://www.ja-sig.org/wiki/display/JSG/cas-dev