Re: [cas-user] CAS v7.2.x recurring outage issue, unresponsive, no logs

Ocean Liu Thu, 13 Nov 2025 10:18:12 -0800

That's interesting! Glad you figured it out!

I am curious, did you only experience the hanging during CAS restarts, or
was there also unresponsiveness while CAS was already running?


On Thu, Nov 13, 2025 at 8:08 AM Derek Badge <[email protected]> wrote:

> Thanks for the help, it definitely put me on the right track. I went back
> and re-enabled Virtual Threads, and (conveniently, in this case) the
> service immediately failed to start.
>
> I had a large number of CLOSE_WAIT connections on 8443 from our load
> balancer, but what I missed earlier was that they were all IPv6 addresses.
> Since we don't actively use IPv6, this led me to suspect a network stack
> conflict.
>
> I added -Djava.net.preferIPv4Stack=true to the cas.service systemd unit,
> and that most likely has resolved the issue. The service is now starting
> reliably (at least on the test servers, after 10 or so restarts).
>
> This also explains our previous workaround: blocking port 8443 with the
> firewall was preventing the load balancer's IPv6-mapped connections from
> hitting the service during the race-sensitive startup, which is why it
> worked.
>
> It seems the other paths we were investigating were likely red herrings.
>
> On Tuesday, November 11, 2025 at 10:40:04 PM UTC-5 Ocean Liu wrote:
>
>> Thanks for sharing your experience, Derek!
>>
>> We did consider disabling Virtual Threads but initially held off due to
>> performance concerns.
>> We are now confident we've found the root cause without having to revert
>> that feature.
>>
>> Working with our Unicon consultant and analyzing jcmd thread dumps (which
>> include Virtual Thread status), we determined the core issue was a Virtual
>> Thread deadlock triggered during SAML SP metadata fetching as part of the
>> Single Logout (SLO) process.
>>
>> By default, CAS enables SLO and aggressively fetches SAML SP metadata
>> from external URLs without using the local cache.
>>
>> We implemented the following changes:
>> - SLO Disabled: We globally disabled Single Logout.
>> - Metadata Cache Priority: We configured CAS to prioritize and utilize
>> the local metadata cache.
>> - Targeted Local Files: We manually moved several critical SAML SP
>> metadata URLs (like RStudio) to local files.
>>
>> These steps have kept our CAS service stable since implementation.
>>
>> We also monitored the `CLOSE_WAIT` TCP sockets on our server, which
>> provided a key metric for success:
>> - Before Changes: We saw spikes of 40–60 `CLOSE_WAIT` TCP sockets
>> coinciding with SSO session timeouts.
>> - After Changes: The count is consistently low, hovering around 2
>> CLOSE_WAIT TCP sockets.
>>
>> We hope this helps.
>> On Tuesday, November 11, 2025 at 1:37:24 PM UTC-8 Derek Badge wrote:
>>
>>> My issues were definitely related to the virtual threads.
>>> Intermittently (frequently) my CAS would fail to start on reboot/restart of
>>> service.  Similarly, there were no "deadlocks" for me, just thread forever
>>> waiting.  Like Richard, it would help when I blocked traffic during
>>> startup.
>>>
>>> Disabling these has completely fixed my issues (knock on wood, I've had
>>> about 10 restarts now with no hangs, it was 50% or greater chance before
>>> this), although I suspect the eager setting is un-needed.
>>> spring.cloud.refresh.scope.eager-init=false
>>> spring.threads.virtual.enabled=false
>>>
>>> On Thursday, October 23, 2025 at 2:12:35 PM UTC-4 Ocean Liu wrote:
>>>
>>>> Hi Richard,
>>>>
>>>> Thank you for your response! We have made some progress on the
>>>> diagnostics and have a strong new working theory.
>>>>
>>>> We ran two initial `jstack` thread dumps and confirmed there are no
>>>> signs of deadlocks among the standard platform threads.
>>>> However, the system's behavior still strongly suggests a deadlock
>>>> condition, leading us to suspect the newer virtual threads.
>>>> We found this article from Netflix highly relevant to our suspicion:
>>>> https://netflixtechblog.com/java-21-virtual-threads-dude-wheres-my-lock-3052540e231d
>>>> Our next step is to use `jcmd` to capture thread dumps in JSON format
>>>> (`jcmd <pid> Thread.dump_to_file -format=json <filename>`) so we can
>>>> specifically inspect the status of the virtual threads.
>>>> We will also capture a heap dump.
>>>>
>>>> We discovered a key correlation that points to the root cause:
>>>> During the AWS outage on Monday morning (10/20), our CAS service
>>>> repeatedly became unresponsive every 15-30 minutes. We knew that
>>>> Instructure (Canvas) was down.
>>>> Once we switched the Instructure SAML metadata source from the external
>>>> URL to a local backup copy, the unresponsiveness immediately stopped and
>>>> has not recurred since.
>>>>
>>>> Based on this evidence, our strong working theory is that the
>>>> unresponsiveness is directly related to a SAML metadata fetching failure
>>>> during periods of external network instability, likely causing a virtual
>>>> thread deadlock.
>>>>
>>>> Thank you for your suggestions, and we will keep you updated once we
>>>> have analyzed the jcmd and heap dump results.
>>>> On Monday, October 20, 2025 at 9:32:49 AM UTC-7 Richard Frovarp wrote:
>>>>
>>>>> If you can, jstack the process when it goes unresponsive. If there is
>>>>> a deadlock, it will tell you where it is.
>>>>>
>>>>> As the same user that it is running as
>>>>>
>>>>> jstack <pid>
>>>>>
>>>>> If there is a deadlock detected, it will tell you so at the end of the
>>>>> stack.
>>>>> On 10/20/25 10:11, Ocean Liu wrote:
>>>>>
>>>>> Hello Karol,
>>>>>
>>>>> Thank you for confirming that you are seeing this issue on v7.3.0 as
>>>>> well. Unfortunately, we also do not have steps to reproduce it yet.
>>>>>
>>>>> We had two more incidents just this morning, October 20th, around 7:00
>>>>> AM and 8:00 AM PDT.
>>>>> We have a current hypothesis that we are investigating: we are
>>>>> wondering if these CAS issues might be related to the widely reported AWS
>>>>> issues that occurred this morning, potentially impacting the availability
>>>>> of our service providers' SAML metadata.
>>>>>
>>>>> Have you noticed any correlation between your incidents and any
>>>>> external cloud service provider outages?
>>>>>
>>>>> Thanks again for sharing!
>>>>>
>>>>> On Monday, October 20, 2025 at 7:17:55 AM UTC-7 Karol Zajac wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> we have same issue on 7.3.0. Unfortunately i don't know how to
>>>>>> reproduce and what is causing it.
>>>>>>
>>>>>> wtorek, 14 października 2025 o 23:17:22 UTC+2 Ocean Liu napisał(a):
>>>>>>
>>>>>>> Hi Richard and Pascal,
>>>>>>>
>>>>>>> Thank you for the help! We will explore the external tomcat option.
>>>>>>>
>>>>>>> On Tuesday, October 14, 2025 at 9:53:46 AM UTC-7 Pascal Rigaux wrote:
>>>>>>>
>>>>>>>> On 14/10/2025 01:00, Ocean Liu wrote:
>>>>>>>>
>>>>>>>> > Has anyone encountered this specific behavior, particularly the
>>>>>>>> need to block inbound traffic to achieve a successful restart? Any 
>>>>>>>> shared
>>>>>>>> experiences or guidance would be greatly appreciated.
>>>>>>>>
>>>>>>>> On this subject, see msg "Deadlock on startup"
>>>>>>>> https://www.mail-archive.com/[email protected]/msg17421.html
>>>>>>>> <https://www.mail-archive.com/[email protected]/msg17421.html>
>>>>>>>>
>>>>>>>> We switched from internal tomcat to external tomcat and this issue
>>>>>>>> is gone :-)
>>>>>>>>
>>>>>>>> cu
>>>>>>>>
>>>>>>>

-- 

Ocean Liu | Enterprise Web Developer | Whitman College
WCTS Building 105F - 509.527.4973

-- 
- Website: https://apereo.github.io/cas
- List Guidelines: https://goo.gl/1VRrw7
- Contributions: https://goo.gl/mh7qDG
--- 
You received this message because you are subscribed to the Google Groups "CAS 
Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/a/apereo.org/d/msgid/cas-user/CAJwP14YkA_OZT4sj4m9BYNyaD%2BdkCyqRXGp53hH%3D%3DQthA9ZY1w%40mail.gmail.com.

Re: [cas-user] CAS v7.2.x recurring outage issue, unresponsive, no logs

Reply via email to