Hi Richard, Thank you for your response! We have made some progress on the diagnostics and have a strong new working theory.
We ran two initial `jstack` thread dumps and confirmed there are no signs of deadlocks among the standard platform threads. However, the system's behavior still strongly suggests a deadlock condition, leading us to suspect the newer virtual threads. We found this article from Netflix highly relevant to our suspicion: https://netflixtechblog.com/java-21-virtual-threads-dude-wheres-my-lock-3052540e231d Our next step is to use `jcmd` to capture thread dumps in JSON format (`jcmd <pid> Thread.dump_to_file -format=json <filename>`) so we can specifically inspect the status of the virtual threads. We will also capture a heap dump. We discovered a key correlation that points to the root cause: During the AWS outage on Monday morning (10/20), our CAS service repeatedly became unresponsive every 15-30 minutes. We knew that Instructure (Canvas) was down. Once we switched the Instructure SAML metadata source from the external URL to a local backup copy, the unresponsiveness immediately stopped and has not recurred since. Based on this evidence, our strong working theory is that the unresponsiveness is directly related to a SAML metadata fetching failure during periods of external network instability, likely causing a virtual thread deadlock. Thank you for your suggestions, and we will keep you updated once we have analyzed the jcmd and heap dump results. On Monday, October 20, 2025 at 9:32:49 AM UTC-7 Richard Frovarp wrote: > If you can, jstack the process when it goes unresponsive. If there is a > deadlock, it will tell you where it is. > > As the same user that it is running as > > jstack <pid> > > If there is a deadlock detected, it will tell you so at the end of the > stack. > On 10/20/25 10:11, Ocean Liu wrote: > > Hello Karol, > > Thank you for confirming that you are seeing this issue on v7.3.0 as well. > Unfortunately, we also do not have steps to reproduce it yet. > > We had two more incidents just this morning, October 20th, around 7:00 AM > and 8:00 AM PDT. > We have a current hypothesis that we are investigating: we are wondering > if these CAS issues might be related to the widely reported AWS issues that > occurred this morning, potentially impacting the availability of our > service providers' SAML metadata. > > Have you noticed any correlation between your incidents and any external > cloud service provider outages? > > Thanks again for sharing! > > On Monday, October 20, 2025 at 7:17:55 AM UTC-7 Karol Zajac wrote: > >> Hello, >> >> we have same issue on 7.3.0. Unfortunately i don't know how to reproduce >> and what is causing it. >> >> wtorek, 14 października 2025 o 23:17:22 UTC+2 Ocean Liu napisał(a): >> >>> Hi Richard and Pascal, >>> >>> Thank you for the help! We will explore the external tomcat option. >>> >>> On Tuesday, October 14, 2025 at 9:53:46 AM UTC-7 Pascal Rigaux wrote: >>> >>>> On 14/10/2025 01:00, Ocean Liu wrote: >>>> >>>> > Has anyone encountered this specific behavior, particularly the need >>>> to block inbound traffic to achieve a successful restart? Any shared >>>> experiences or guidance would be greatly appreciated. >>>> >>>> On this subject, see msg "Deadlock on startup" >>>> https://www.mail-archive.com/[email protected]/msg17421.html >>>> <https://www.mail-archive.com/[email protected]/msg17421.html> >>>> >>>> We switched from internal tomcat to external tomcat and this issue is >>>> gone :-) >>>> >>>> cu >>>> >>> -- - Website: https://apereo.github.io/cas - List Guidelines: https://goo.gl/1VRrw7 - Contributions: https://goo.gl/mh7qDG --- You received this message because you are subscribed to the Google Groups "CAS Community" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/a/apereo.org/d/msgid/cas-user/0b5b315b-17ea-4d1e-b38f-0bd7231c10c7n%40apereo.org.
