Re: [Oorexx-devel] Question ad a hang situation

René Jansen via Oorexx-devel Wed, 13 Aug 2025 18:53:13 -0700

These problems can be very timing specific - I have seen situations where an 
unexpected slowdown or speedup of an application caused a world of trouble. 
Important lessons are: you have to trust the implementation of your mutexes; 
for example OS/2 had a very unsafe one. The 'window of trouble' needs to be as 
small as possible, and a slight speedup can cause your code to avoid it. I 
think that we need to take the suggestion to not jump into Rexx while holding a 
lock should be taken seriously. In my last job before the current one I 
debugged some Java problems involving locks on memory objects and you'd never 
guess what the cause was: a programmer had read all the documentation and used 
IBM Websphere very bulletproof multithreaded pooling mechanism. He did 
everything right and the design was great, until I noticed that he did not cast 
his objects to the right classes, so the mechanism could never find equality 
and the server lock up solid every time. This was diagnosed with a Java 
profiling tool called GlowRoot  - it is free and open source and it also ran on 
z/OS J9 Java; its profiling of the memory locks was spot-on.


A good locking strategy often involves a protocol for locking the critical 
sections always in the same order, so there is a chance of escaping instead of 
running into a deadlock. All contentious situations are potential performance 
and continuity problems.

Now for: it was working .. if you have a reproducible case, you should be able 
to use git bisect on the git copy of the oorexx codebase to determine exactly 
when the possible error was introduced (although you see in the previous text 
that I do not assume it is that easy).

Anyway, git bisect enables you to choose two points in history, and do a binary 
search to build and indicate good or bad. It then provably takes the minimal 
number of steps to find out which commit was bad. Then I would use the 
AI-suggested approach to find out where it goes wrong.

best regards,

René.

> On 13 Aug 2025, at 20:01, Rony G. Flatscher <rony.flatsc...@wu.ac.at> wrote:
> 
> Yes, impressive, indeed, thank you, René.
> 
> However there is one important piece of information that is missing: that 
> application did work a couple of years ago, and sometimes works, mostly on 
> Linux and macOS, if it does. Therefore I think that in principle everything 
> is set out correctly, but that a situation arises that causes that hang. 
> Having spent quite some time with that area of the interpreter I was hoping 
> to get some  hints, ideas, theories what could be a possible reason for it. 
> Granted, this is an optimistic request, but hey, if one does not try one 
> would not get a "lucky punch" hint. If there are no ideas, then I need to 
> systematically go through the code which may take a lot of time and effort.
> 
> ---rony
> 
> 
> 
> On 13.08.2025 16:08, Gilbert Barmwater via Oorexx-devel wrote:
>> WOW! Unbelievable that AI could do that, at least to me.  If most of that 
>> is, in fact, meaningful - and I have no way of knowing if it is or isn't, 
>> way over my head - this is a significant addition to the ability to debug 
>> complex code problems.  I have my fingers crossed that this will help Rony 
>> find his problem because I want to believe in this approach.  Thanks for 
>> sharing René!
>> 
>> Gil
>> 
>> On 8/13/2025 9:53 AM, René Jansen via Oorexx-devel wrote:
>>> I asked my buddy AI for you:
>>> 
>>> Short version: almost everything here is blocked, waiting on kernel 
>>> objects/events. One thread (the one with rexx.dll in the stack) is trying 
>>> to attach to ooRexx via BSF4ooRexx while the JVM is already involved, and 
>>> it’s waiting for the ooRexx kernel mutex. Meanwhile several JVM worker 
>>> threads are also parked in waits. This pattern screams lock-order inversion 
>>> / deadlock between Java ↔ ooRexx (likely “call into Rexx while holding 
>>> something, which calls back into Java, which tries to attach back into Rexx 
>>> and blocks on the Rexx global lock”).
>>> 
>>> What the stacks say
>>> 
>>> Repeated tops of stack:
>>> ntdll!NtWaitForSingleObject → KernelBase!WaitForSingleObjectEx → jvm.dll!...
>>> That’s a parked/waiting thread (monitor/condition/OS event); not runnable.
>>> 
>>> The interesting one (Not Flagged, tid > 23728):
>>> win32u!NtUserMsgWaitForMultipleObjectsEx → user32!RealMsgWait… → 
>>> rexx.dll!waitHandle → SysMutex::request → ActivityManager::lockKernel → 
>>> Activity::waitForKernel → ActivityManager::addWaitingActivity → 
>>> Activity::requestAccess → Activity::nestAttach → 
>>> InterpreterInstance::attachThread → AttachThread → BSF4ooRexx850.dll …
>>> This shows a BSF/ooRexx attach trying to acquire the Rexx kernel lock and 
>>> waiting (message-wait variant, so it can pump messages).
>>> 
>>> Many other JVM threads show the same wait pattern at different internal pcs 
>>> (jvm.dll!0x7117e75a, …e82f, etc.). That’s consistent with Java threads 
>>> parked on monitors/conditions (e.g., GC, JIT, RMI, pool workers) while some 
>>> other thread is expected to make progress—but isn’t.
>>> 
>>> Likely scenario
>>> 
>>> A thread entered ooRexx and still holds the Rexx kernel mutex (ooRexx is 
>>> single-kernel-locked).
>>> 
>>> During that work, it called into Java (BSF).
>>> 
>>> Another thread (or a callback on the same thread through message pumping) 
>>> is now trying to AttachThread / nestAttach back into ooRexx → it blocks on 
>>> SysMutex::request() because the kernel lock is still held.
>>> 
>>> Because of that, Java side work that would release the original path can’t 
>>> continue → deadlock (or at least a hard stall).
>>> 
>>> How to confirm quickly
>>> 
>>> Grab a full dump and in WinDbg run:
>>> 
>>> !locks (who owns the critical section / mutex in rexx.dll).
>>> 
>>> ~* kb to see if any thread is inside rexx.dll not waiting (the probable 
>>> owner).
>>> 
>>> From the Java side, run jcmd <pid> Thread.print (or jstack) and look for:
>>> 
>>> Threads in parking to wait for <…> with owning thread info.
>>> 
>>> Any thread doing a native call into BSF/ooRexx.
>>> 
>>> In Process Explorer’s Threads tab, the Waiting reason for the 
>>> Rexx-attaching thread will be a wait on an event/mutex handle; the owner 
>>> (if resolvable) points to the holder.
>>> 
>>> Practical fixes / mitigations
>>> 
>>> Never call back into Rexx while holding the Rexx kernel lock. In native 
>>> ooRexx extensions this usually means:
>>> 
>>> Minimize the critical section; release the kernel before making Java/BSF 
>>> calls that can reenter.
>>> 
>>> If you must call out, queue work (post to another thread) instead of 
>>> re-entering synchronously.
>>> 
>>> For BSF4ooRexx:
>>> 
>>> Ensure every AttachThread is matched with DetachThread in a try/finally.
>>> 
>>> Avoid nested attaches (nestAttach shows on your stack). If you’re already 
>>> attached, reuse the context; don’t attach again.
>>> 
>>> If Java code calls into Rexx, and Rexx code calls back into Java, enforce a 
>>> single direction at a time (e.g., schedule the callback asynchronously).
>>> 
>>> Add diagnostics:
>>> 
>>> On the Rexx side, enable verbose/trace around kernel lock operations if 
>>> available; log when you enter/leave Rexx from Java and vice versa.
>>> 
>>> On the Java side, log before/after BSF calls and include current thread 
>>> name + whether already attached.
>>> 
>>> If it’s not a full deadlock (just a stall)
>>> 
>>> It can still be head-of-line blocking: one long-running Rexx activity holds 
>>> the kernel, and many threads pile up on requestAccess(). The cure is the 
>>> same—shorten the locked region or make the long task cooperative 
>>> (yield/release).
>>> 
>>> If you paste the owner of the Rexx mutex from !locks (or a jstack snippet 
>>> showing the thread doing the call into Rexx while others block), I can 
>>> point at the exact offender and the safest place to release the lock.
>>> 
>>> best regards,
>>> 
>>> René.
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Oorexx-devel mailing list
>>> Oorexx-devel@lists.sourceforge.net 
>>> <mailto:Oorexx-devel@lists.sourceforge.net>
>>> https://lists.sourceforge.net/lists/listinfo/oorexx-devel
>> -- 
>> Gil Barmwater
>> 
>> 
>> 
>> _______________________________________________
>> Oorexx-devel mailing list
>> Oorexx-devel@lists.sourceforge.net 
>> <mailto:Oorexx-devel@lists.sourceforge.net>
>> https://lists.sourceforge.net/lists/listinfo/oorexx-devel
> -- 
> --
> __________________________________________________________________________________
> 
> Prof. Dr. Rony G. Flatscher, iR
> Department Wirtschaftsinformatik und Operations Management
> WU Wien
> Welthandelsplatz 1
> A-1020  Wien/Vienna, Austria/Europe
> 
> http://www.wu.ac.at <http://www.wu.ac.at/>
> __________________________________________________________________________________
> 
> 
> 
> 
> 
> _______________________________________________
> Oorexx-devel mailing list
> Oorexx-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/oorexx-devel

_______________________________________________
Oorexx-devel mailing list
Oorexx-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oorexx-devel

Re: [Oorexx-devel] Question ad a hang situation

Reply via email to