Interesting, that sounds very different from what we had in our dumps, 
which all seemed related to beans, but gives me something else to check 
into.  We already had SLO off.  I was just using jstack, but I guess I 
should learn jcmd as well.
org.springframework.cloud.context.scope.GenericScope$BeanLifecycleWrapper.getBean(GenericScope.java:373)

        - locked <0x00000006c8588aa8> (a java.lang.String)

        at 
org.springframework.cloud.context.scope.GenericScope.get(GenericScope.java:177)

On Tuesday, November 11, 2025 at 10:40:04 PM UTC-5 Ocean Liu wrote:

> Thanks for sharing your experience, Derek!
>
> We did consider disabling Virtual Threads but initially held off due to 
> performance concerns.
> We are now confident we've found the root cause without having to revert 
> that feature.
>
> Working with our Unicon consultant and analyzing jcmd thread dumps (which 
> include Virtual Thread status), we determined the core issue was a Virtual 
> Thread deadlock triggered during SAML SP metadata fetching as part of the 
> Single Logout (SLO) process.
>
> By default, CAS enables SLO and aggressively fetches SAML SP metadata from 
> external URLs without using the local cache.
>
> We implemented the following changes:
> - SLO Disabled: We globally disabled Single Logout.
> - Metadata Cache Priority: We configured CAS to prioritize and utilize the 
> local metadata cache.
> - Targeted Local Files: We manually moved several critical SAML SP 
> metadata URLs (like RStudio) to local files.
>
> These steps have kept our CAS service stable since implementation.
>
> We also monitored the `CLOSE_WAIT` TCP sockets on our server, which 
> provided a key metric for success:
> - Before Changes: We saw spikes of 40–60 `CLOSE_WAIT` TCP sockets 
> coinciding with SSO session timeouts.
> - After Changes: The count is consistently low, hovering around 2 
> CLOSE_WAIT TCP sockets.
>
> We hope this helps.
> On Tuesday, November 11, 2025 at 1:37:24 PM UTC-8 Derek Badge wrote:
>
>> My issues were definitely related to the virtual threads.  Intermittently 
>> (frequently) my CAS would fail to start on reboot/restart of service.  
>> Similarly, there were no "deadlocks" for me, just thread forever waiting.  
>> Like Richard, it would help when I blocked traffic during startup. 
>>
>> Disabling these has completely fixed my issues (knock on wood, I've had 
>> about 10 restarts now with no hangs, it was 50% or greater chance before 
>> this), although I suspect the eager setting is un-needed.  
>> spring.cloud.refresh.scope.eager-init=false
>> spring.threads.virtual.enabled=false
>>
>> On Thursday, October 23, 2025 at 2:12:35 PM UTC-4 Ocean Liu wrote:
>>
>>> Hi Richard,
>>>
>>> Thank you for your response! We have made some progress on the 
>>> diagnostics and have a strong new working theory.
>>>
>>> We ran two initial `jstack` thread dumps and confirmed there are no 
>>> signs of deadlocks among the standard platform threads.
>>> However, the system's behavior still strongly suggests a deadlock 
>>> condition, leading us to suspect the newer virtual threads.
>>> We found this article from Netflix highly relevant to our suspicion: 
>>> https://netflixtechblog.com/java-21-virtual-threads-dude-wheres-my-lock-3052540e231d
>>> Our next step is to use `jcmd` to capture thread dumps in JSON format 
>>> (`jcmd <pid> Thread.dump_to_file -format=json <filename>`) so we can 
>>> specifically inspect the status of the virtual threads.
>>> We will also capture a heap dump.
>>>
>>> We discovered a key correlation that points to the root cause:
>>> During the AWS outage on Monday morning (10/20), our CAS service 
>>> repeatedly became unresponsive every 15-30 minutes. We knew that 
>>> Instructure (Canvas) was down.
>>> Once we switched the Instructure SAML metadata source from the external 
>>> URL to a local backup copy, the unresponsiveness immediately stopped and 
>>> has not recurred since.
>>>
>>> Based on this evidence, our strong working theory is that the 
>>> unresponsiveness is directly related to a SAML metadata fetching failure 
>>> during periods of external network instability, likely causing a virtual 
>>> thread deadlock.
>>>
>>> Thank you for your suggestions, and we will keep you updated once we 
>>> have analyzed the jcmd and heap dump results.
>>> On Monday, October 20, 2025 at 9:32:49 AM UTC-7 Richard Frovarp wrote:
>>>
>>>> If you can, jstack the process when it goes unresponsive. If there is a 
>>>> deadlock, it will tell you where it is.
>>>>
>>>> As the same user that it is running as 
>>>>
>>>> jstack <pid>
>>>>
>>>> If there is a deadlock detected, it will tell you so at the end of the 
>>>> stack.
>>>> On 10/20/25 10:11, Ocean Liu wrote:
>>>>
>>>> Hello Karol,
>>>>
>>>> Thank you for confirming that you are seeing this issue on v7.3.0 as 
>>>> well. Unfortunately, we also do not have steps to reproduce it yet.
>>>>
>>>> We had two more incidents just this morning, October 20th, around 7:00 
>>>> AM and 8:00 AM PDT.
>>>> We have a current hypothesis that we are investigating: we are 
>>>> wondering if these CAS issues might be related to the widely reported AWS 
>>>> issues that occurred this morning, potentially impacting the availability 
>>>> of our service providers' SAML metadata.
>>>>
>>>> Have you noticed any correlation between your incidents and any 
>>>> external cloud service provider outages?
>>>>
>>>> Thanks again for sharing! 
>>>>
>>>> On Monday, October 20, 2025 at 7:17:55 AM UTC-7 Karol Zajac wrote:
>>>>
>>>>> Hello, 
>>>>>
>>>>> we have same issue on 7.3.0. Unfortunately i don't know how to 
>>>>> reproduce and what is causing it.
>>>>>
>>>>> wtorek, 14 października 2025 o 23:17:22 UTC+2 Ocean Liu napisał(a):
>>>>>
>>>>>> Hi Richard and Pascal, 
>>>>>>
>>>>>> Thank you for the help! We will explore the external tomcat option.
>>>>>>
>>>>>> On Tuesday, October 14, 2025 at 9:53:46 AM UTC-7 Pascal Rigaux wrote:
>>>>>>
>>>>>>> On 14/10/2025 01:00, Ocean Liu wrote: 
>>>>>>>
>>>>>>> > Has anyone encountered this specific behavior, particularly the 
>>>>>>> need to block inbound traffic to achieve a successful restart? Any 
>>>>>>> shared 
>>>>>>> experiences or guidance would be greatly appreciated. 
>>>>>>>
>>>>>>> On this subject, see msg "Deadlock on startup" 
>>>>>>> https://www.mail-archive.com/[email protected]/msg17421.html 
>>>>>>> <https://www.mail-archive.com/[email protected]/msg17421.html> 
>>>>>>>
>>>>>>> We switched from internal tomcat to external tomcat and this issue 
>>>>>>> is gone :-) 
>>>>>>>
>>>>>>> cu 
>>>>>>>
>>>>>>

-- 
- Website: https://apereo.github.io/cas
- List Guidelines: https://goo.gl/1VRrw7
- Contributions: https://goo.gl/mh7qDG
--- 
You received this message because you are subscribed to the Google Groups "CAS 
Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/a/apereo.org/d/msgid/cas-user/fb3952ff-bd26-4cd4-9674-d319e61a7670n%40apereo.org.

Reply via email to