My issues were definitely related to the virtual threads. Intermittently (frequently) my CAS would fail to start on reboot/restart of service. Similarly, there were no "deadlocks" for me, just thread forever waiting. Like Richard, it would help when I blocked traffic during startup.
Disabling these has completely fixed my issues (knock on wood, I've had about 10 restarts now with no hangs, it was 50% or greater chance before this), although I suspect the eager setting is un-needed. spring.cloud.refresh.scope.eager-init=false spring.threads.virtual.enabled=false On Thursday, October 23, 2025 at 2:12:35 PM UTC-4 Ocean Liu wrote: > Hi Richard, > > Thank you for your response! We have made some progress on the diagnostics > and have a strong new working theory. > > We ran two initial `jstack` thread dumps and confirmed there are no signs > of deadlocks among the standard platform threads. > However, the system's behavior still strongly suggests a deadlock > condition, leading us to suspect the newer virtual threads. > We found this article from Netflix highly relevant to our suspicion: > https://netflixtechblog.com/java-21-virtual-threads-dude-wheres-my-lock-3052540e231d > Our next step is to use `jcmd` to capture thread dumps in JSON format > (`jcmd <pid> Thread.dump_to_file -format=json <filename>`) so we can > specifically inspect the status of the virtual threads. > We will also capture a heap dump. > > We discovered a key correlation that points to the root cause: > During the AWS outage on Monday morning (10/20), our CAS service > repeatedly became unresponsive every 15-30 minutes. We knew that > Instructure (Canvas) was down. > Once we switched the Instructure SAML metadata source from the external > URL to a local backup copy, the unresponsiveness immediately stopped and > has not recurred since. > > Based on this evidence, our strong working theory is that the > unresponsiveness is directly related to a SAML metadata fetching failure > during periods of external network instability, likely causing a virtual > thread deadlock. > > Thank you for your suggestions, and we will keep you updated once we have > analyzed the jcmd and heap dump results. > On Monday, October 20, 2025 at 9:32:49 AM UTC-7 Richard Frovarp wrote: > >> If you can, jstack the process when it goes unresponsive. If there is a >> deadlock, it will tell you where it is. >> >> As the same user that it is running as >> >> jstack <pid> >> >> If there is a deadlock detected, it will tell you so at the end of the >> stack. >> On 10/20/25 10:11, Ocean Liu wrote: >> >> Hello Karol, >> >> Thank you for confirming that you are seeing this issue on v7.3.0 as >> well. Unfortunately, we also do not have steps to reproduce it yet. >> >> We had two more incidents just this morning, October 20th, around 7:00 AM >> and 8:00 AM PDT. >> We have a current hypothesis that we are investigating: we are wondering >> if these CAS issues might be related to the widely reported AWS issues that >> occurred this morning, potentially impacting the availability of our >> service providers' SAML metadata. >> >> Have you noticed any correlation between your incidents and any external >> cloud service provider outages? >> >> Thanks again for sharing! >> >> On Monday, October 20, 2025 at 7:17:55 AM UTC-7 Karol Zajac wrote: >> >>> Hello, >>> >>> we have same issue on 7.3.0. Unfortunately i don't know how to reproduce >>> and what is causing it. >>> >>> wtorek, 14 października 2025 o 23:17:22 UTC+2 Ocean Liu napisał(a): >>> >>>> Hi Richard and Pascal, >>>> >>>> Thank you for the help! We will explore the external tomcat option. >>>> >>>> On Tuesday, October 14, 2025 at 9:53:46 AM UTC-7 Pascal Rigaux wrote: >>>> >>>>> On 14/10/2025 01:00, Ocean Liu wrote: >>>>> >>>>> > Has anyone encountered this specific behavior, particularly the need >>>>> to block inbound traffic to achieve a successful restart? Any shared >>>>> experiences or guidance would be greatly appreciated. >>>>> >>>>> On this subject, see msg "Deadlock on startup" >>>>> https://www.mail-archive.com/[email protected]/msg17421.html >>>>> <https://www.mail-archive.com/[email protected]/msg17421.html> >>>>> >>>>> We switched from internal tomcat to external tomcat and this issue is >>>>> gone :-) >>>>> >>>>> cu >>>>> >>>> -- - Website: https://apereo.github.io/cas - List Guidelines: https://goo.gl/1VRrw7 - Contributions: https://goo.gl/mh7qDG --- You received this message because you are subscribed to the Google Groups "CAS Community" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/a/apereo.org/d/msgid/cas-user/ddfbe9f7-e261-49ec-a326-75dcdc8d6220n%40apereo.org.
