That's interesting! Glad you figured it out! I am curious, did you only experience the hanging during CAS restarts, or was there also unresponsiveness while CAS was already running?
On Thu, Nov 13, 2025 at 8:08 AM Derek Badge <[email protected]> wrote: > Thanks for the help, it definitely put me on the right track. I went back > and re-enabled Virtual Threads, and (conveniently, in this case) the > service immediately failed to start. > > I had a large number of CLOSE_WAIT connections on 8443 from our load > balancer, but what I missed earlier was that they were all IPv6 addresses. > Since we don't actively use IPv6, this led me to suspect a network stack > conflict. > > I added -Djava.net.preferIPv4Stack=true to the cas.service systemd unit, > and that most likely has resolved the issue. The service is now starting > reliably (at least on the test servers, after 10 or so restarts). > > This also explains our previous workaround: blocking port 8443 with the > firewall was preventing the load balancer's IPv6-mapped connections from > hitting the service during the race-sensitive startup, which is why it > worked. > > It seems the other paths we were investigating were likely red herrings. > > On Tuesday, November 11, 2025 at 10:40:04 PM UTC-5 Ocean Liu wrote: > >> Thanks for sharing your experience, Derek! >> >> We did consider disabling Virtual Threads but initially held off due to >> performance concerns. >> We are now confident we've found the root cause without having to revert >> that feature. >> >> Working with our Unicon consultant and analyzing jcmd thread dumps (which >> include Virtual Thread status), we determined the core issue was a Virtual >> Thread deadlock triggered during SAML SP metadata fetching as part of the >> Single Logout (SLO) process. >> >> By default, CAS enables SLO and aggressively fetches SAML SP metadata >> from external URLs without using the local cache. >> >> We implemented the following changes: >> - SLO Disabled: We globally disabled Single Logout. >> - Metadata Cache Priority: We configured CAS to prioritize and utilize >> the local metadata cache. >> - Targeted Local Files: We manually moved several critical SAML SP >> metadata URLs (like RStudio) to local files. >> >> These steps have kept our CAS service stable since implementation. >> >> We also monitored the `CLOSE_WAIT` TCP sockets on our server, which >> provided a key metric for success: >> - Before Changes: We saw spikes of 40–60 `CLOSE_WAIT` TCP sockets >> coinciding with SSO session timeouts. >> - After Changes: The count is consistently low, hovering around 2 >> CLOSE_WAIT TCP sockets. >> >> We hope this helps. >> On Tuesday, November 11, 2025 at 1:37:24 PM UTC-8 Derek Badge wrote: >> >>> My issues were definitely related to the virtual threads. >>> Intermittently (frequently) my CAS would fail to start on reboot/restart of >>> service. Similarly, there were no "deadlocks" for me, just thread forever >>> waiting. Like Richard, it would help when I blocked traffic during >>> startup. >>> >>> Disabling these has completely fixed my issues (knock on wood, I've had >>> about 10 restarts now with no hangs, it was 50% or greater chance before >>> this), although I suspect the eager setting is un-needed. >>> spring.cloud.refresh.scope.eager-init=false >>> spring.threads.virtual.enabled=false >>> >>> On Thursday, October 23, 2025 at 2:12:35 PM UTC-4 Ocean Liu wrote: >>> >>>> Hi Richard, >>>> >>>> Thank you for your response! We have made some progress on the >>>> diagnostics and have a strong new working theory. >>>> >>>> We ran two initial `jstack` thread dumps and confirmed there are no >>>> signs of deadlocks among the standard platform threads. >>>> However, the system's behavior still strongly suggests a deadlock >>>> condition, leading us to suspect the newer virtual threads. >>>> We found this article from Netflix highly relevant to our suspicion: >>>> https://netflixtechblog.com/java-21-virtual-threads-dude-wheres-my-lock-3052540e231d >>>> Our next step is to use `jcmd` to capture thread dumps in JSON format >>>> (`jcmd <pid> Thread.dump_to_file -format=json <filename>`) so we can >>>> specifically inspect the status of the virtual threads. >>>> We will also capture a heap dump. >>>> >>>> We discovered a key correlation that points to the root cause: >>>> During the AWS outage on Monday morning (10/20), our CAS service >>>> repeatedly became unresponsive every 15-30 minutes. We knew that >>>> Instructure (Canvas) was down. >>>> Once we switched the Instructure SAML metadata source from the external >>>> URL to a local backup copy, the unresponsiveness immediately stopped and >>>> has not recurred since. >>>> >>>> Based on this evidence, our strong working theory is that the >>>> unresponsiveness is directly related to a SAML metadata fetching failure >>>> during periods of external network instability, likely causing a virtual >>>> thread deadlock. >>>> >>>> Thank you for your suggestions, and we will keep you updated once we >>>> have analyzed the jcmd and heap dump results. >>>> On Monday, October 20, 2025 at 9:32:49 AM UTC-7 Richard Frovarp wrote: >>>> >>>>> If you can, jstack the process when it goes unresponsive. If there is >>>>> a deadlock, it will tell you where it is. >>>>> >>>>> As the same user that it is running as >>>>> >>>>> jstack <pid> >>>>> >>>>> If there is a deadlock detected, it will tell you so at the end of the >>>>> stack. >>>>> On 10/20/25 10:11, Ocean Liu wrote: >>>>> >>>>> Hello Karol, >>>>> >>>>> Thank you for confirming that you are seeing this issue on v7.3.0 as >>>>> well. Unfortunately, we also do not have steps to reproduce it yet. >>>>> >>>>> We had two more incidents just this morning, October 20th, around 7:00 >>>>> AM and 8:00 AM PDT. >>>>> We have a current hypothesis that we are investigating: we are >>>>> wondering if these CAS issues might be related to the widely reported AWS >>>>> issues that occurred this morning, potentially impacting the availability >>>>> of our service providers' SAML metadata. >>>>> >>>>> Have you noticed any correlation between your incidents and any >>>>> external cloud service provider outages? >>>>> >>>>> Thanks again for sharing! >>>>> >>>>> On Monday, October 20, 2025 at 7:17:55 AM UTC-7 Karol Zajac wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> we have same issue on 7.3.0. Unfortunately i don't know how to >>>>>> reproduce and what is causing it. >>>>>> >>>>>> wtorek, 14 października 2025 o 23:17:22 UTC+2 Ocean Liu napisał(a): >>>>>> >>>>>>> Hi Richard and Pascal, >>>>>>> >>>>>>> Thank you for the help! We will explore the external tomcat option. >>>>>>> >>>>>>> On Tuesday, October 14, 2025 at 9:53:46 AM UTC-7 Pascal Rigaux wrote: >>>>>>> >>>>>>>> On 14/10/2025 01:00, Ocean Liu wrote: >>>>>>>> >>>>>>>> > Has anyone encountered this specific behavior, particularly the >>>>>>>> need to block inbound traffic to achieve a successful restart? Any >>>>>>>> shared >>>>>>>> experiences or guidance would be greatly appreciated. >>>>>>>> >>>>>>>> On this subject, see msg "Deadlock on startup" >>>>>>>> https://www.mail-archive.com/[email protected]/msg17421.html >>>>>>>> <https://www.mail-archive.com/[email protected]/msg17421.html> >>>>>>>> >>>>>>>> We switched from internal tomcat to external tomcat and this issue >>>>>>>> is gone :-) >>>>>>>> >>>>>>>> cu >>>>>>>> >>>>>>> -- Ocean Liu | Enterprise Web Developer | Whitman College WCTS Building 105F - 509.527.4973 -- - Website: https://apereo.github.io/cas - List Guidelines: https://goo.gl/1VRrw7 - Contributions: https://goo.gl/mh7qDG --- You received this message because you are subscribed to the Google Groups "CAS Community" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/a/apereo.org/d/msgid/cas-user/CAJwP14YkA_OZT4sj4m9BYNyaD%2BdkCyqRXGp53hH%3D%3DQthA9ZY1w%40mail.gmail.com.
