Interesting, that sounds very different from what we had in our dumps,
which all seemed related to beans, but gives me something else to check
into. We already had SLO off. I was just using jstack, but I guess I
should learn jcmd as well.
org.springframework.cloud.context.scope.GenericScope$BeanLifecycleWrapper.getBean(GenericScope.java:373)
- locked <0x00000006c8588aa8> (a java.lang.String)
at
org.springframework.cloud.context.scope.GenericScope.get(GenericScope.java:177)
On Tuesday, November 11, 2025 at 10:40:04 PM UTC-5 Ocean Liu wrote:
> Thanks for sharing your experience, Derek!
>
> We did consider disabling Virtual Threads but initially held off due to
> performance concerns.
> We are now confident we've found the root cause without having to revert
> that feature.
>
> Working with our Unicon consultant and analyzing jcmd thread dumps (which
> include Virtual Thread status), we determined the core issue was a Virtual
> Thread deadlock triggered during SAML SP metadata fetching as part of the
> Single Logout (SLO) process.
>
> By default, CAS enables SLO and aggressively fetches SAML SP metadata from
> external URLs without using the local cache.
>
> We implemented the following changes:
> - SLO Disabled: We globally disabled Single Logout.
> - Metadata Cache Priority: We configured CAS to prioritize and utilize the
> local metadata cache.
> - Targeted Local Files: We manually moved several critical SAML SP
> metadata URLs (like RStudio) to local files.
>
> These steps have kept our CAS service stable since implementation.
>
> We also monitored the `CLOSE_WAIT` TCP sockets on our server, which
> provided a key metric for success:
> - Before Changes: We saw spikes of 40–60 `CLOSE_WAIT` TCP sockets
> coinciding with SSO session timeouts.
> - After Changes: The count is consistently low, hovering around 2
> CLOSE_WAIT TCP sockets.
>
> We hope this helps.
> On Tuesday, November 11, 2025 at 1:37:24 PM UTC-8 Derek Badge wrote:
>
>> My issues were definitely related to the virtual threads. Intermittently
>> (frequently) my CAS would fail to start on reboot/restart of service.
>> Similarly, there were no "deadlocks" for me, just thread forever waiting.
>> Like Richard, it would help when I blocked traffic during startup.
>>
>> Disabling these has completely fixed my issues (knock on wood, I've had
>> about 10 restarts now with no hangs, it was 50% or greater chance before
>> this), although I suspect the eager setting is un-needed.
>> spring.cloud.refresh.scope.eager-init=false
>> spring.threads.virtual.enabled=false
>>
>> On Thursday, October 23, 2025 at 2:12:35 PM UTC-4 Ocean Liu wrote:
>>
>>> Hi Richard,
>>>
>>> Thank you for your response! We have made some progress on the
>>> diagnostics and have a strong new working theory.
>>>
>>> We ran two initial `jstack` thread dumps and confirmed there are no
>>> signs of deadlocks among the standard platform threads.
>>> However, the system's behavior still strongly suggests a deadlock
>>> condition, leading us to suspect the newer virtual threads.
>>> We found this article from Netflix highly relevant to our suspicion:
>>> https://netflixtechblog.com/java-21-virtual-threads-dude-wheres-my-lock-3052540e231d
>>> Our next step is to use `jcmd` to capture thread dumps in JSON format
>>> (`jcmd <pid> Thread.dump_to_file -format=json <filename>`) so we can
>>> specifically inspect the status of the virtual threads.
>>> We will also capture a heap dump.
>>>
>>> We discovered a key correlation that points to the root cause:
>>> During the AWS outage on Monday morning (10/20), our CAS service
>>> repeatedly became unresponsive every 15-30 minutes. We knew that
>>> Instructure (Canvas) was down.
>>> Once we switched the Instructure SAML metadata source from the external
>>> URL to a local backup copy, the unresponsiveness immediately stopped and
>>> has not recurred since.
>>>
>>> Based on this evidence, our strong working theory is that the
>>> unresponsiveness is directly related to a SAML metadata fetching failure
>>> during periods of external network instability, likely causing a virtual
>>> thread deadlock.
>>>
>>> Thank you for your suggestions, and we will keep you updated once we
>>> have analyzed the jcmd and heap dump results.
>>> On Monday, October 20, 2025 at 9:32:49 AM UTC-7 Richard Frovarp wrote:
>>>
>>>> If you can, jstack the process when it goes unresponsive. If there is a
>>>> deadlock, it will tell you where it is.
>>>>
>>>> As the same user that it is running as
>>>>
>>>> jstack <pid>
>>>>
>>>> If there is a deadlock detected, it will tell you so at the end of the
>>>> stack.
>>>> On 10/20/25 10:11, Ocean Liu wrote:
>>>>
>>>> Hello Karol,
>>>>
>>>> Thank you for confirming that you are seeing this issue on v7.3.0 as
>>>> well. Unfortunately, we also do not have steps to reproduce it yet.
>>>>
>>>> We had two more incidents just this morning, October 20th, around 7:00
>>>> AM and 8:00 AM PDT.
>>>> We have a current hypothesis that we are investigating: we are
>>>> wondering if these CAS issues might be related to the widely reported AWS
>>>> issues that occurred this morning, potentially impacting the availability
>>>> of our service providers' SAML metadata.
>>>>
>>>> Have you noticed any correlation between your incidents and any
>>>> external cloud service provider outages?
>>>>
>>>> Thanks again for sharing!
>>>>
>>>> On Monday, October 20, 2025 at 7:17:55 AM UTC-7 Karol Zajac wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> we have same issue on 7.3.0. Unfortunately i don't know how to
>>>>> reproduce and what is causing it.
>>>>>
>>>>> wtorek, 14 października 2025 o 23:17:22 UTC+2 Ocean Liu napisał(a):
>>>>>
>>>>>> Hi Richard and Pascal,
>>>>>>
>>>>>> Thank you for the help! We will explore the external tomcat option.
>>>>>>
>>>>>> On Tuesday, October 14, 2025 at 9:53:46 AM UTC-7 Pascal Rigaux wrote:
>>>>>>
>>>>>>> On 14/10/2025 01:00, Ocean Liu wrote:
>>>>>>>
>>>>>>> > Has anyone encountered this specific behavior, particularly the
>>>>>>> need to block inbound traffic to achieve a successful restart? Any
>>>>>>> shared
>>>>>>> experiences or guidance would be greatly appreciated.
>>>>>>>
>>>>>>> On this subject, see msg "Deadlock on startup"
>>>>>>> https://www.mail-archive.com/[email protected]/msg17421.html
>>>>>>> <https://www.mail-archive.com/[email protected]/msg17421.html>
>>>>>>>
>>>>>>> We switched from internal tomcat to external tomcat and this issue
>>>>>>> is gone :-)
>>>>>>>
>>>>>>> cu
>>>>>>>
>>>>>>
--
- Website: https://apereo.github.io/cas
- List Guidelines: https://goo.gl/1VRrw7
- Contributions: https://goo.gl/mh7qDG
---
You received this message because you are subscribed to the Google Groups "CAS
Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion visit
https://groups.google.com/a/apereo.org/d/msgid/cas-user/fb3952ff-bd26-4cd4-9674-d319e61a7670n%40apereo.org.