Only during restarts in our case. The only outage we have had while running was the /var/run space filling up to 100% due to logs over a very long period of uptime.
On Thu, Nov 13, 2025 at 12:59 PM Ocean Liu <[email protected]> wrote: > That's interesting! Glad you figured it out! > > I am curious, did you only experience the hanging during CAS restarts, or > was there also unresponsiveness while CAS was already running? > > On Thu, Nov 13, 2025 at 8:08 AM Derek Badge <[email protected]> > wrote: > >> Thanks for the help, it definitely put me on the right track. I went back >> and re-enabled Virtual Threads, and (conveniently, in this case) the >> service immediately failed to start. >> >> I had a large number of CLOSE_WAIT connections on 8443 from our load >> balancer, but what I missed earlier was that they were all IPv6 addresses. >> Since we don't actively use IPv6, this led me to suspect a network stack >> conflict. >> >> I added -Djava.net.preferIPv4Stack=true to the cas.service systemd unit, >> and that most likely has resolved the issue. The service is now starting >> reliably (at least on the test servers, after 10 or so restarts). >> >> This also explains our previous workaround: blocking port 8443 with the >> firewall was preventing the load balancer's IPv6-mapped connections from >> hitting the service during the race-sensitive startup, which is why it >> worked. >> >> It seems the other paths we were investigating were likely red herrings. >> >> On Tuesday, November 11, 2025 at 10:40:04 PM UTC-5 Ocean Liu wrote: >> >>> Thanks for sharing your experience, Derek! >>> >>> We did consider disabling Virtual Threads but initially held off due to >>> performance concerns. >>> We are now confident we've found the root cause without having to revert >>> that feature. >>> >>> Working with our Unicon consultant and analyzing jcmd thread dumps >>> (which include Virtual Thread status), we determined the core issue was a >>> Virtual Thread deadlock triggered during SAML SP metadata fetching as part >>> of the Single Logout (SLO) process. >>> >>> By default, CAS enables SLO and aggressively fetches SAML SP metadata >>> from external URLs without using the local cache. >>> >>> We implemented the following changes: >>> - SLO Disabled: We globally disabled Single Logout. >>> - Metadata Cache Priority: We configured CAS to prioritize and utilize >>> the local metadata cache. >>> - Targeted Local Files: We manually moved several critical SAML SP >>> metadata URLs (like RStudio) to local files. >>> >>> These steps have kept our CAS service stable since implementation. >>> >>> We also monitored the `CLOSE_WAIT` TCP sockets on our server, which >>> provided a key metric for success: >>> - Before Changes: We saw spikes of 40–60 `CLOSE_WAIT` TCP sockets >>> coinciding with SSO session timeouts. >>> - After Changes: The count is consistently low, hovering around 2 >>> CLOSE_WAIT TCP sockets. >>> >>> We hope this helps. >>> On Tuesday, November 11, 2025 at 1:37:24 PM UTC-8 Derek Badge wrote: >>> >>>> My issues were definitely related to the virtual threads. >>>> Intermittently (frequently) my CAS would fail to start on reboot/restart of >>>> service. Similarly, there were no "deadlocks" for me, just thread forever >>>> waiting. Like Richard, it would help when I blocked traffic during >>>> startup. >>>> >>>> Disabling these has completely fixed my issues (knock on wood, I've had >>>> about 10 restarts now with no hangs, it was 50% or greater chance before >>>> this), although I suspect the eager setting is un-needed. >>>> spring.cloud.refresh.scope.eager-init=false >>>> spring.threads.virtual.enabled=false >>>> >>>> On Thursday, October 23, 2025 at 2:12:35 PM UTC-4 Ocean Liu wrote: >>>> >>>>> Hi Richard, >>>>> >>>>> Thank you for your response! We have made some progress on the >>>>> diagnostics and have a strong new working theory. >>>>> >>>>> We ran two initial `jstack` thread dumps and confirmed there are no >>>>> signs of deadlocks among the standard platform threads. >>>>> However, the system's behavior still strongly suggests a deadlock >>>>> condition, leading us to suspect the newer virtual threads. >>>>> We found this article from Netflix highly relevant to our suspicion: >>>>> https://netflixtechblog.com/java-21-virtual-threads-dude-wheres-my-lock-3052540e231d >>>>> Our next step is to use `jcmd` to capture thread dumps in JSON format >>>>> (`jcmd <pid> Thread.dump_to_file -format=json <filename>`) so we can >>>>> specifically inspect the status of the virtual threads. >>>>> We will also capture a heap dump. >>>>> >>>>> We discovered a key correlation that points to the root cause: >>>>> During the AWS outage on Monday morning (10/20), our CAS service >>>>> repeatedly became unresponsive every 15-30 minutes. We knew that >>>>> Instructure (Canvas) was down. >>>>> Once we switched the Instructure SAML metadata source from the >>>>> external URL to a local backup copy, the unresponsiveness immediately >>>>> stopped and has not recurred since. >>>>> >>>>> Based on this evidence, our strong working theory is that the >>>>> unresponsiveness is directly related to a SAML metadata fetching failure >>>>> during periods of external network instability, likely causing a virtual >>>>> thread deadlock. >>>>> >>>>> Thank you for your suggestions, and we will keep you updated once we >>>>> have analyzed the jcmd and heap dump results. >>>>> On Monday, October 20, 2025 at 9:32:49 AM UTC-7 Richard Frovarp wrote: >>>>> >>>>>> If you can, jstack the process when it goes unresponsive. If there is >>>>>> a deadlock, it will tell you where it is. >>>>>> >>>>>> As the same user that it is running as >>>>>> >>>>>> jstack <pid> >>>>>> >>>>>> If there is a deadlock detected, it will tell you so at the end of >>>>>> the stack. >>>>>> On 10/20/25 10:11, Ocean Liu wrote: >>>>>> >>>>>> Hello Karol, >>>>>> >>>>>> Thank you for confirming that you are seeing this issue on v7.3.0 as >>>>>> well. Unfortunately, we also do not have steps to reproduce it yet. >>>>>> >>>>>> We had two more incidents just this morning, October 20th, around >>>>>> 7:00 AM and 8:00 AM PDT. >>>>>> We have a current hypothesis that we are investigating: we are >>>>>> wondering if these CAS issues might be related to the widely reported AWS >>>>>> issues that occurred this morning, potentially impacting the availability >>>>>> of our service providers' SAML metadata. >>>>>> >>>>>> Have you noticed any correlation between your incidents and any >>>>>> external cloud service provider outages? >>>>>> >>>>>> Thanks again for sharing! >>>>>> >>>>>> On Monday, October 20, 2025 at 7:17:55 AM UTC-7 Karol Zajac wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> we have same issue on 7.3.0. Unfortunately i don't know how to >>>>>>> reproduce and what is causing it. >>>>>>> >>>>>>> wtorek, 14 października 2025 o 23:17:22 UTC+2 Ocean Liu napisał(a): >>>>>>> >>>>>>>> Hi Richard and Pascal, >>>>>>>> >>>>>>>> Thank you for the help! We will explore the external tomcat option. >>>>>>>> >>>>>>>> On Tuesday, October 14, 2025 at 9:53:46 AM UTC-7 Pascal Rigaux >>>>>>>> wrote: >>>>>>>> >>>>>>>>> On 14/10/2025 01:00, Ocean Liu wrote: >>>>>>>>> >>>>>>>>> > Has anyone encountered this specific behavior, particularly the >>>>>>>>> need to block inbound traffic to achieve a successful restart? Any >>>>>>>>> shared >>>>>>>>> experiences or guidance would be greatly appreciated. >>>>>>>>> >>>>>>>>> On this subject, see msg "Deadlock on startup" >>>>>>>>> https://www.mail-archive.com/[email protected]/msg17421.html >>>>>>>>> <https://www.mail-archive.com/[email protected]/msg17421.html> >>>>>>>>> >>>>>>>>> We switched from internal tomcat to external tomcat and this issue >>>>>>>>> is gone :-) >>>>>>>>> >>>>>>>>> cu >>>>>>>>> >>>>>>>> > > -- > > Ocean Liu | Enterprise Web Developer | Whitman College > WCTS Building 105F - 509.527.4973 > -- - Website: https://apereo.github.io/cas - List Guidelines: https://goo.gl/1VRrw7 - Contributions: https://goo.gl/mh7qDG --- You received this message because you are subscribed to the Google Groups "CAS Community" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/a/apereo.org/d/msgid/cas-user/CADvUoW3EHR%2BGAcvSqPjfBRwU9H896z6k85kO4rvgEJwJ%3D7EzDw%40mail.gmail.com.
