Thanks for the help, it definitely put me on the right track. I went back and re-enabled Virtual Threads, and (conveniently, in this case) the service immediately failed to start.
I had a large number of CLOSE_WAIT connections on 8443 from our load balancer, but what I missed earlier was that they were all IPv6 addresses. Since we don't actively use IPv6, this led me to suspect a network stack conflict. I added -Djava.net.preferIPv4Stack=true to the cas.service systemd unit, and that most likely has resolved the issue. The service is now starting reliably (at least on the test servers, after 10 or so restarts). This also explains our previous workaround: blocking port 8443 with the firewall was preventing the load balancer's IPv6-mapped connections from hitting the service during the race-sensitive startup, which is why it worked. It seems the other paths we were investigating were likely red herrings. On Tuesday, November 11, 2025 at 10:40:04 PM UTC-5 Ocean Liu wrote: > Thanks for sharing your experience, Derek! > > We did consider disabling Virtual Threads but initially held off due to > performance concerns. > We are now confident we've found the root cause without having to revert > that feature. > > Working with our Unicon consultant and analyzing jcmd thread dumps (which > include Virtual Thread status), we determined the core issue was a Virtual > Thread deadlock triggered during SAML SP metadata fetching as part of the > Single Logout (SLO) process. > > By default, CAS enables SLO and aggressively fetches SAML SP metadata from > external URLs without using the local cache. > > We implemented the following changes: > - SLO Disabled: We globally disabled Single Logout. > - Metadata Cache Priority: We configured CAS to prioritize and utilize the > local metadata cache. > - Targeted Local Files: We manually moved several critical SAML SP > metadata URLs (like RStudio) to local files. > > These steps have kept our CAS service stable since implementation. > > We also monitored the `CLOSE_WAIT` TCP sockets on our server, which > provided a key metric for success: > - Before Changes: We saw spikes of 40–60 `CLOSE_WAIT` TCP sockets > coinciding with SSO session timeouts. > - After Changes: The count is consistently low, hovering around 2 > CLOSE_WAIT TCP sockets. > > We hope this helps. > On Tuesday, November 11, 2025 at 1:37:24 PM UTC-8 Derek Badge wrote: > >> My issues were definitely related to the virtual threads. Intermittently >> (frequently) my CAS would fail to start on reboot/restart of service. >> Similarly, there were no "deadlocks" for me, just thread forever waiting. >> Like Richard, it would help when I blocked traffic during startup. >> >> Disabling these has completely fixed my issues (knock on wood, I've had >> about 10 restarts now with no hangs, it was 50% or greater chance before >> this), although I suspect the eager setting is un-needed. >> spring.cloud.refresh.scope.eager-init=false >> spring.threads.virtual.enabled=false >> >> On Thursday, October 23, 2025 at 2:12:35 PM UTC-4 Ocean Liu wrote: >> >>> Hi Richard, >>> >>> Thank you for your response! We have made some progress on the >>> diagnostics and have a strong new working theory. >>> >>> We ran two initial `jstack` thread dumps and confirmed there are no >>> signs of deadlocks among the standard platform threads. >>> However, the system's behavior still strongly suggests a deadlock >>> condition, leading us to suspect the newer virtual threads. >>> We found this article from Netflix highly relevant to our suspicion: >>> https://netflixtechblog.com/java-21-virtual-threads-dude-wheres-my-lock-3052540e231d >>> Our next step is to use `jcmd` to capture thread dumps in JSON format >>> (`jcmd <pid> Thread.dump_to_file -format=json <filename>`) so we can >>> specifically inspect the status of the virtual threads. >>> We will also capture a heap dump. >>> >>> We discovered a key correlation that points to the root cause: >>> During the AWS outage on Monday morning (10/20), our CAS service >>> repeatedly became unresponsive every 15-30 minutes. We knew that >>> Instructure (Canvas) was down. >>> Once we switched the Instructure SAML metadata source from the external >>> URL to a local backup copy, the unresponsiveness immediately stopped and >>> has not recurred since. >>> >>> Based on this evidence, our strong working theory is that the >>> unresponsiveness is directly related to a SAML metadata fetching failure >>> during periods of external network instability, likely causing a virtual >>> thread deadlock. >>> >>> Thank you for your suggestions, and we will keep you updated once we >>> have analyzed the jcmd and heap dump results. >>> On Monday, October 20, 2025 at 9:32:49 AM UTC-7 Richard Frovarp wrote: >>> >>>> If you can, jstack the process when it goes unresponsive. If there is a >>>> deadlock, it will tell you where it is. >>>> >>>> As the same user that it is running as >>>> >>>> jstack <pid> >>>> >>>> If there is a deadlock detected, it will tell you so at the end of the >>>> stack. >>>> On 10/20/25 10:11, Ocean Liu wrote: >>>> >>>> Hello Karol, >>>> >>>> Thank you for confirming that you are seeing this issue on v7.3.0 as >>>> well. Unfortunately, we also do not have steps to reproduce it yet. >>>> >>>> We had two more incidents just this morning, October 20th, around 7:00 >>>> AM and 8:00 AM PDT. >>>> We have a current hypothesis that we are investigating: we are >>>> wondering if these CAS issues might be related to the widely reported AWS >>>> issues that occurred this morning, potentially impacting the availability >>>> of our service providers' SAML metadata. >>>> >>>> Have you noticed any correlation between your incidents and any >>>> external cloud service provider outages? >>>> >>>> Thanks again for sharing! >>>> >>>> On Monday, October 20, 2025 at 7:17:55 AM UTC-7 Karol Zajac wrote: >>>> >>>>> Hello, >>>>> >>>>> we have same issue on 7.3.0. Unfortunately i don't know how to >>>>> reproduce and what is causing it. >>>>> >>>>> wtorek, 14 października 2025 o 23:17:22 UTC+2 Ocean Liu napisał(a): >>>>> >>>>>> Hi Richard and Pascal, >>>>>> >>>>>> Thank you for the help! We will explore the external tomcat option. >>>>>> >>>>>> On Tuesday, October 14, 2025 at 9:53:46 AM UTC-7 Pascal Rigaux wrote: >>>>>> >>>>>>> On 14/10/2025 01:00, Ocean Liu wrote: >>>>>>> >>>>>>> > Has anyone encountered this specific behavior, particularly the >>>>>>> need to block inbound traffic to achieve a successful restart? Any >>>>>>> shared >>>>>>> experiences or guidance would be greatly appreciated. >>>>>>> >>>>>>> On this subject, see msg "Deadlock on startup" >>>>>>> https://www.mail-archive.com/[email protected]/msg17421.html >>>>>>> <https://www.mail-archive.com/[email protected]/msg17421.html> >>>>>>> >>>>>>> We switched from internal tomcat to external tomcat and this issue >>>>>>> is gone :-) >>>>>>> >>>>>>> cu >>>>>>> >>>>>> -- - Website: https://apereo.github.io/cas - List Guidelines: https://goo.gl/1VRrw7 - Contributions: https://goo.gl/mh7qDG --- You received this message because you are subscribed to the Google Groups "CAS Community" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/a/apereo.org/d/msgid/cas-user/ba0c8c75-c40f-4bb6-bbec-9da856e97d8en%40apereo.org.
