Re: [cas-user] CAS v7.2.x recurring outage issue, unresponsive, no logs

'Derek Badge' via CAS Community Tue, 11 Nov 2025 13:37:27 -0800

My issues were definitely related to the virtual threads.  Intermittently 
(frequently) my CAS would fail to start on reboot/restart of service.  
Similarly, there were no "deadlocks" for me, just thread forever waiting.  
Like Richard, it would help when I blocked traffic during startup.


Disabling these has completely fixed my issues (knock on wood, I've had 
about 10 restarts now with no hangs, it was 50% or greater chance before 
this), although I suspect the eager setting is un-needed.  
spring.cloud.refresh.scope.eager-init=false
spring.threads.virtual.enabled=false

On Thursday, October 23, 2025 at 2:12:35 PM UTC-4 Ocean Liu wrote:

> Hi Richard,
>
> Thank you for your response! We have made some progress on the diagnostics 
> and have a strong new working theory.
>
> We ran two initial `jstack` thread dumps and confirmed there are no signs 
> of deadlocks among the standard platform threads.
> However, the system's behavior still strongly suggests a deadlock 
> condition, leading us to suspect the newer virtual threads.
> We found this article from Netflix highly relevant to our suspicion: 
> https://netflixtechblog.com/java-21-virtual-threads-dude-wheres-my-lock-3052540e231d
> Our next step is to use `jcmd` to capture thread dumps in JSON format 
> (`jcmd <pid> Thread.dump_to_file -format=json <filename>`) so we can 
> specifically inspect the status of the virtual threads.
> We will also capture a heap dump.
>
> We discovered a key correlation that points to the root cause:
> During the AWS outage on Monday morning (10/20), our CAS service 
> repeatedly became unresponsive every 15-30 minutes. We knew that 
> Instructure (Canvas) was down.
> Once we switched the Instructure SAML metadata source from the external 
> URL to a local backup copy, the unresponsiveness immediately stopped and 
> has not recurred since.
>
> Based on this evidence, our strong working theory is that the 
> unresponsiveness is directly related to a SAML metadata fetching failure 
> during periods of external network instability, likely causing a virtual 
> thread deadlock.
>
> Thank you for your suggestions, and we will keep you updated once we have 
> analyzed the jcmd and heap dump results.
> On Monday, October 20, 2025 at 9:32:49 AM UTC-7 Richard Frovarp wrote:
>
>> If you can, jstack the process when it goes unresponsive. If there is a 
>> deadlock, it will tell you where it is.
>>
>> As the same user that it is running as 
>>
>> jstack <pid>
>>
>> If there is a deadlock detected, it will tell you so at the end of the 
>> stack.
>> On 10/20/25 10:11, Ocean Liu wrote:
>>
>> Hello Karol,
>>
>> Thank you for confirming that you are seeing this issue on v7.3.0 as 
>> well. Unfortunately, we also do not have steps to reproduce it yet.
>>
>> We had two more incidents just this morning, October 20th, around 7:00 AM 
>> and 8:00 AM PDT.
>> We have a current hypothesis that we are investigating: we are wondering 
>> if these CAS issues might be related to the widely reported AWS issues that 
>> occurred this morning, potentially impacting the availability of our 
>> service providers' SAML metadata.
>>
>> Have you noticed any correlation between your incidents and any external 
>> cloud service provider outages?
>>
>> Thanks again for sharing! 
>>
>> On Monday, October 20, 2025 at 7:17:55 AM UTC-7 Karol Zajac wrote:
>>
>>> Hello, 
>>>
>>> we have same issue on 7.3.0. Unfortunately i don't know how to reproduce 
>>> and what is causing it.
>>>
>>> wtorek, 14 października 2025 o 23:17:22 UTC+2 Ocean Liu napisał(a):
>>>
>>>> Hi Richard and Pascal, 
>>>>
>>>> Thank you for the help! We will explore the external tomcat option.
>>>>
>>>> On Tuesday, October 14, 2025 at 9:53:46 AM UTC-7 Pascal Rigaux wrote:
>>>>
>>>>> On 14/10/2025 01:00, Ocean Liu wrote: 
>>>>>
>>>>> > Has anyone encountered this specific behavior, particularly the need 
>>>>> to block inbound traffic to achieve a successful restart? Any shared 
>>>>> experiences or guidance would be greatly appreciated. 
>>>>>
>>>>> On this subject, see msg "Deadlock on startup" 
>>>>> https://www.mail-archive.com/[email protected]/msg17421.html 
>>>>> <https://www.mail-archive.com/[email protected]/msg17421.html> 
>>>>>
>>>>> We switched from internal tomcat to external tomcat and this issue is 
>>>>> gone :-) 
>>>>>
>>>>> cu 
>>>>>
>>>>

-- 
- Website: https://apereo.github.io/cas
- List Guidelines: https://goo.gl/1VRrw7
- Contributions: https://goo.gl/mh7qDG
--- 
You received this message because you are subscribed to the Google Groups "CAS 
Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/a/apereo.org/d/msgid/cas-user/ddfbe9f7-e261-49ec-a326-75dcdc8d6220n%40apereo.org.

Re: [cas-user] CAS v7.2.x recurring outage issue, unresponsive, no logs

Reply via email to