Re: [cas-user] CAS v7.2.x recurring outage issue, unresponsive, no logs

Ocean Liu Thu, 23 Oct 2025 11:12:35 -0700

Hi Richard,

Thank you for your response! We have made some progress on the diagnostics 
and have a strong new working theory.

We ran two initial `jstack` thread dumps and confirmed there are no signs 
of deadlocks among the standard platform threads.
However, the system's behavior still strongly suggests a deadlock 
condition, leading us to suspect the newer virtual threads.
We found this article from Netflix highly relevant to our suspicion: 
https://netflixtechblog.com/java-21-virtual-threads-dude-wheres-my-lock-3052540e231d
Our next step is to use `jcmd` to capture thread dumps in JSON format 
(`jcmd <pid> Thread.dump_to_file -format=json <filename>`) so we can 
specifically inspect the status of the virtual threads.
We will also capture a heap dump.

We discovered a key correlation that points to the root cause:
During the AWS outage on Monday morning (10/20), our CAS service repeatedly 
became unresponsive every 15-30 minutes. We knew that Instructure (Canvas) 
was down.
Once we switched the Instructure SAML metadata source from the external URL 
to a local backup copy, the unresponsiveness immediately stopped and has 
not recurred since.

Based on this evidence, our strong working theory is that the 
unresponsiveness is directly related to a SAML metadata fetching failure 
during periods of external network instability, likely causing a virtual 
thread deadlock.

Thank you for your suggestions, and we will keep you updated once we have 
analyzed the jcmd and heap dump results.
On Monday, October 20, 2025 at 9:32:49 AM UTC-7 Richard Frovarp wrote:

> If you can, jstack the process when it goes unresponsive. If there is a 
> deadlock, it will tell you where it is.
>
> As the same user that it is running as 
>
> jstack <pid>
>
> If there is a deadlock detected, it will tell you so at the end of the 
> stack.
> On 10/20/25 10:11, Ocean Liu wrote:
>
> Hello Karol,
>
> Thank you for confirming that you are seeing this issue on v7.3.0 as well. 
> Unfortunately, we also do not have steps to reproduce it yet.
>
> We had two more incidents just this morning, October 20th, around 7:00 AM 
> and 8:00 AM PDT.
> We have a current hypothesis that we are investigating: we are wondering 
> if these CAS issues might be related to the widely reported AWS issues that 
> occurred this morning, potentially impacting the availability of our 
> service providers' SAML metadata.
>
> Have you noticed any correlation between your incidents and any external 
> cloud service provider outages?
>
> Thanks again for sharing! 
>
> On Monday, October 20, 2025 at 7:17:55 AM UTC-7 Karol Zajac wrote:
>
>> Hello, 
>>
>> we have same issue on 7.3.0. Unfortunately i don't know how to reproduce 
>> and what is causing it.
>>
>> wtorek, 14 października 2025 o 23:17:22 UTC+2 Ocean Liu napisał(a):
>>
>>> Hi Richard and Pascal, 
>>>
>>> Thank you for the help! We will explore the external tomcat option.
>>>
>>> On Tuesday, October 14, 2025 at 9:53:46 AM UTC-7 Pascal Rigaux wrote:
>>>
>>>> On 14/10/2025 01:00, Ocean Liu wrote: 
>>>>
>>>> > Has anyone encountered this specific behavior, particularly the need 
>>>> to block inbound traffic to achieve a successful restart? Any shared 
>>>> experiences or guidance would be greatly appreciated. 
>>>>
>>>> On this subject, see msg "Deadlock on startup" 
>>>> https://www.mail-archive.com/[email protected]/msg17421.html 
>>>> <https://www.mail-archive.com/[email protected]/msg17421.html> 
>>>>
>>>> We switched from internal tomcat to external tomcat and this issue is 
>>>> gone :-) 
>>>>
>>>> cu 
>>>>
>>>

-- 
- Website: https://apereo.github.io/cas
- List Guidelines: https://goo.gl/1VRrw7
- Contributions: https://goo.gl/mh7qDG
--- 
You received this message because you are subscribed to the Google Groups "CAS 
Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/a/apereo.org/d/msgid/cas-user/0b5b315b-17ea-4d1e-b38f-0bd7231c10c7n%40apereo.org.

Re: [cas-user] CAS v7.2.x recurring outage issue, unresponsive, no logs

Reply via email to