Re: [cas-user] CAS v7.2.x recurring outage issue, unresponsive, no logs

'Derek Badge' via CAS Community Thu, 13 Nov 2025 12:41:57 -0800

Only during restarts in our case.  The only outage we have had while
running was the /var/run space filling up to 100% due to logs over a very
long period of uptime.


On Thu, Nov 13, 2025 at 12:59 PM Ocean Liu <[email protected]> wrote:

> That's interesting! Glad you figured it out!
>
> I am curious, did you only experience the hanging during CAS restarts, or
> was there also unresponsiveness while CAS was already running?
>
> On Thu, Nov 13, 2025 at 8:08 AM Derek Badge <[email protected]>
> wrote:
>
>> Thanks for the help, it definitely put me on the right track. I went back
>> and re-enabled Virtual Threads, and (conveniently, in this case) the
>> service immediately failed to start.
>>
>> I had a large number of CLOSE_WAIT connections on 8443 from our load
>> balancer, but what I missed earlier was that they were all IPv6 addresses.
>> Since we don't actively use IPv6, this led me to suspect a network stack
>> conflict.
>>
>> I added -Djava.net.preferIPv4Stack=true to the cas.service systemd unit,
>> and that most likely has resolved the issue. The service is now starting
>> reliably (at least on the test servers, after 10 or so restarts).
>>
>> This also explains our previous workaround: blocking port 8443 with the
>> firewall was preventing the load balancer's IPv6-mapped connections from
>> hitting the service during the race-sensitive startup, which is why it
>> worked.
>>
>> It seems the other paths we were investigating were likely red herrings.
>>
>> On Tuesday, November 11, 2025 at 10:40:04 PM UTC-5 Ocean Liu wrote:
>>
>>> Thanks for sharing your experience, Derek!
>>>
>>> We did consider disabling Virtual Threads but initially held off due to
>>> performance concerns.
>>> We are now confident we've found the root cause without having to revert
>>> that feature.
>>>
>>> Working with our Unicon consultant and analyzing jcmd thread dumps
>>> (which include Virtual Thread status), we determined the core issue was a
>>> Virtual Thread deadlock triggered during SAML SP metadata fetching as part
>>> of the Single Logout (SLO) process.
>>>
>>> By default, CAS enables SLO and aggressively fetches SAML SP metadata
>>> from external URLs without using the local cache.
>>>
>>> We implemented the following changes:
>>> - SLO Disabled: We globally disabled Single Logout.
>>> - Metadata Cache Priority: We configured CAS to prioritize and utilize
>>> the local metadata cache.
>>> - Targeted Local Files: We manually moved several critical SAML SP
>>> metadata URLs (like RStudio) to local files.
>>>
>>> These steps have kept our CAS service stable since implementation.
>>>
>>> We also monitored the `CLOSE_WAIT` TCP sockets on our server, which
>>> provided a key metric for success:
>>> - Before Changes: We saw spikes of 40–60 `CLOSE_WAIT` TCP sockets
>>> coinciding with SSO session timeouts.
>>> - After Changes: The count is consistently low, hovering around 2
>>> CLOSE_WAIT TCP sockets.
>>>
>>> We hope this helps.
>>> On Tuesday, November 11, 2025 at 1:37:24 PM UTC-8 Derek Badge wrote:
>>>
>>>> My issues were definitely related to the virtual threads.
>>>> Intermittently (frequently) my CAS would fail to start on reboot/restart of
>>>> service.  Similarly, there were no "deadlocks" for me, just thread forever
>>>> waiting.  Like Richard, it would help when I blocked traffic during
>>>> startup.
>>>>
>>>> Disabling these has completely fixed my issues (knock on wood, I've had
>>>> about 10 restarts now with no hangs, it was 50% or greater chance before
>>>> this), although I suspect the eager setting is un-needed.
>>>> spring.cloud.refresh.scope.eager-init=false
>>>> spring.threads.virtual.enabled=false
>>>>
>>>> On Thursday, October 23, 2025 at 2:12:35 PM UTC-4 Ocean Liu wrote:
>>>>
>>>>> Hi Richard,
>>>>>
>>>>> Thank you for your response! We have made some progress on the
>>>>> diagnostics and have a strong new working theory.
>>>>>
>>>>> We ran two initial `jstack` thread dumps and confirmed there are no
>>>>> signs of deadlocks among the standard platform threads.
>>>>> However, the system's behavior still strongly suggests a deadlock
>>>>> condition, leading us to suspect the newer virtual threads.
>>>>> We found this article from Netflix highly relevant to our suspicion:
>>>>> https://netflixtechblog.com/java-21-virtual-threads-dude-wheres-my-lock-3052540e231d
>>>>> Our next step is to use `jcmd` to capture thread dumps in JSON format
>>>>> (`jcmd <pid> Thread.dump_to_file -format=json <filename>`) so we can
>>>>> specifically inspect the status of the virtual threads.
>>>>> We will also capture a heap dump.
>>>>>
>>>>> We discovered a key correlation that points to the root cause:
>>>>> During the AWS outage on Monday morning (10/20), our CAS service
>>>>> repeatedly became unresponsive every 15-30 minutes. We knew that
>>>>> Instructure (Canvas) was down.
>>>>> Once we switched the Instructure SAML metadata source from the
>>>>> external URL to a local backup copy, the unresponsiveness immediately
>>>>> stopped and has not recurred since.
>>>>>
>>>>> Based on this evidence, our strong working theory is that the
>>>>> unresponsiveness is directly related to a SAML metadata fetching failure
>>>>> during periods of external network instability, likely causing a virtual
>>>>> thread deadlock.
>>>>>
>>>>> Thank you for your suggestions, and we will keep you updated once we
>>>>> have analyzed the jcmd and heap dump results.
>>>>> On Monday, October 20, 2025 at 9:32:49 AM UTC-7 Richard Frovarp wrote:
>>>>>
>>>>>> If you can, jstack the process when it goes unresponsive. If there is
>>>>>> a deadlock, it will tell you where it is.
>>>>>>
>>>>>> As the same user that it is running as
>>>>>>
>>>>>> jstack <pid>
>>>>>>
>>>>>> If there is a deadlock detected, it will tell you so at the end of
>>>>>> the stack.
>>>>>> On 10/20/25 10:11, Ocean Liu wrote:
>>>>>>
>>>>>> Hello Karol,
>>>>>>
>>>>>> Thank you for confirming that you are seeing this issue on v7.3.0 as
>>>>>> well. Unfortunately, we also do not have steps to reproduce it yet.
>>>>>>
>>>>>> We had two more incidents just this morning, October 20th, around
>>>>>> 7:00 AM and 8:00 AM PDT.
>>>>>> We have a current hypothesis that we are investigating: we are
>>>>>> wondering if these CAS issues might be related to the widely reported AWS
>>>>>> issues that occurred this morning, potentially impacting the availability
>>>>>> of our service providers' SAML metadata.
>>>>>>
>>>>>> Have you noticed any correlation between your incidents and any
>>>>>> external cloud service provider outages?
>>>>>>
>>>>>> Thanks again for sharing!
>>>>>>
>>>>>> On Monday, October 20, 2025 at 7:17:55 AM UTC-7 Karol Zajac wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> we have same issue on 7.3.0. Unfortunately i don't know how to
>>>>>>> reproduce and what is causing it.
>>>>>>>
>>>>>>> wtorek, 14 października 2025 o 23:17:22 UTC+2 Ocean Liu napisał(a):
>>>>>>>
>>>>>>>> Hi Richard and Pascal,
>>>>>>>>
>>>>>>>> Thank you for the help! We will explore the external tomcat option.
>>>>>>>>
>>>>>>>> On Tuesday, October 14, 2025 at 9:53:46 AM UTC-7 Pascal Rigaux
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> On 14/10/2025 01:00, Ocean Liu wrote:
>>>>>>>>>
>>>>>>>>> > Has anyone encountered this specific behavior, particularly the
>>>>>>>>> need to block inbound traffic to achieve a successful restart? Any 
>>>>>>>>> shared
>>>>>>>>> experiences or guidance would be greatly appreciated.
>>>>>>>>>
>>>>>>>>> On this subject, see msg "Deadlock on startup"
>>>>>>>>> https://www.mail-archive.com/[email protected]/msg17421.html
>>>>>>>>> <https://www.mail-archive.com/[email protected]/msg17421.html>
>>>>>>>>>
>>>>>>>>> We switched from internal tomcat to external tomcat and this issue
>>>>>>>>> is gone :-)
>>>>>>>>>
>>>>>>>>> cu
>>>>>>>>>
>>>>>>>>
>
> --
>
> Ocean Liu | Enterprise Web Developer | Whitman College
> WCTS Building 105F - 509.527.4973
>

-- 
- Website: https://apereo.github.io/cas
- List Guidelines: https://goo.gl/1VRrw7
- Contributions: https://goo.gl/mh7qDG
--- 
You received this message because you are subscribed to the Google Groups "CAS 
Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/a/apereo.org/d/msgid/cas-user/CADvUoW3EHR%2BGAcvSqPjfBRwU9H896z6k85kO4rvgEJwJ%3D7EzDw%40mail.gmail.com.

Re: [cas-user] CAS v7.2.x recurring outage issue, unresponsive, no logs

Reply via email to