Dear René,
On 14.08.2025 03:52, René Jansen via Oorexx-devel wrote:
These problems can be very timing specific - I have seen situations where an unexpected slowdown
or speedup of an application caused a world of trouble. Important lessons are: you have to trust
the implementation of your mutexes; for example OS/2 had a very unsafe one. The 'window of
trouble' needs to be as small as possible, and a slight speedup can cause your code to avoid it. I
think that we need to take the suggestion to not jump into Rexx while holding a lock should be
taken seriously. In my last job before the current one I debugged some Java problems involving
locks on memory objects and you'd never guess what the cause was: a programmer had read all the
documentation and used IBM Websphere very bulletproof multithreaded pooling mechanism. He did
everything right and the design was great, until I noticed that he did not cast his objects to the
right classes, so the mechanism could never find equality and the server lock up solid every time.
This was diagnosed with a Java profiling tool called GlowRoot - it is free and open source and it
also ran on z/OS J9 Java; its profiling of the memory locks was spot-on.
A good locking strategy often involves a protocol for locking the critical sections always in the
same order, so there is a chance of escaping instead of running into a deadlock. All contentious
situations are potential performance and continuity problems.
Now for: it was working .. if you have a reproducible case, you should be able to use git bisect
on the git copy of the oorexx codebase to determine exactly when the possible error was introduced
(although you see in the previous text that I do not assume it is that easy).
Anyway, git bisect enables you to choose two points in history, and do a binary search to build
and indicate good or bad. It then provably takes the minimal number of steps to find out which
commit was bad. Then I would use the AI-suggested approach to find out where it goes wrong.
thank you for your thoughts and ideas!
Best regards
---rony
best regards,
René.
On 13 Aug 2025, at 20:01, Rony G. Flatscher <rony.flatsc...@wu.ac.at> wrote:
Yes, impressive, indeed, thank you, René.
However there is one important piece of information that is missing: that application did work a
couple of years ago, and sometimes works, mostly on Linux and macOS, if it does. Therefore I
think that in principle everything is set out correctly, but that a situation arises that causes
that hang. Having spent quite some time with that area of the interpreter I was hoping to get
some hints, ideas, theories what could be a possible reason for it. Granted, this is an
optimistic request, but hey, if one does not try one would not get a "lucky punch" hint. If there
are no ideas, then I need to systematically go through the code which may take a lot of time and
effort.
---rony
On 13.08.2025 16:08, Gilbert Barmwater via Oorexx-devel wrote:
WOW! Unbelievable that AI could do that, at least to me. If most of that is, in fact,
meaningful - and I have no way of knowing if it is or isn't, way over my head - this is a
significant addition to the ability to debug complex code problems. I have my fingers crossed
that this will help Rony find his problem because I want to believe in this approach. Thanks for
sharing René!
Gil
On 8/13/2025 9:53 AM, René Jansen via Oorexx-devel wrote:
I asked my buddy AI for you:
Short version: almost everything here is *blocked, waiting on kernel objects/events*. One
thread (the one with |rexx.dll| in the stack) is trying to *attach to ooRexx* via BSF4ooRexx
while the JVM is already involved, and it’s waiting for the *ooRexx kernel mutex*. Meanwhile
several JVM worker threads are also parked in waits. This pattern screams *lock-order inversion
/ deadlock between Java ↔ ooRexx* (likely “call into Rexx while holding something, which calls
back into Java, which tries to attach back into Rexx and blocks on the Rexx global lock”).
What the stacks say
*
Repeated tops of stack:
|ntdll!NtWaitForSingleObject → KernelBase!WaitForSingleObjectEx →
jvm.dll!...|
That’s a *parked/waiting thread* (monitor/condition/OS event); not runnable.
*
The interesting one (Not Flagged, tid |> 23728|):
|win32u!NtUserMsgWaitForMultipleObjectsEx → user32!RealMsgWait… →
rexx.dll!waitHandle →
SysMutex::request → ActivityManager::lockKernel → Activity::waitForKernel →
ActivityManager::addWaitingActivity → Activity::requestAccess →
Activity::nestAttach →
InterpreterInstance::attachThread → AttachThread → BSF4ooRexx850.dll …|
This shows a *BSF/ooRexx attach* trying to acquire the *Rexx kernel lock*
and
*waiting* (message-wait variant, so it can pump messages).
*
Many other JVM threads show the same wait pattern at different internal pcs
(|jvm.dll!0x7117e75a|, |…e82f|, etc.). That’s consistent with *Java threads
parked on
monitors/conditions* (e.g., GC, JIT, RMI, pool workers) while some other
thread is expected
to make progress—but isn’t.
Likely scenario
1.
A thread entered *ooRexx* and still *holds the Rexx kernel mutex* (ooRexx is
single-kernel-locked).
2.
During that work, it *called into Java* (BSF).
3.
Another thread (or a callback on the same thread through message pumping)
is now trying to
*AttachThread / nestAttach* back into ooRexx → it blocks on
|SysMutex::request()| because
the kernel lock is still held.
4.
Because of that, Java side work that would release the original path can’t
continue →
*deadlock* (or at least a hard stall).
How to confirm quickly
*
*Grab a full dump* and in WinDbg run:
o
|!locks| (who owns the critical section / mutex in |rexx.dll|).
o
|~* kb| to see if any thread is inside |rexx.dll| /not/ waiting (the
probable owner).
*
From the Java side, run |jcmd <pid> Thread.print| (or |jstack|) and look
for:
o
Threads in |parking to wait for <…>| with *owning thread* info.
o
Any thread doing a *native call into BSF/ooRexx*.
*
In Process Explorer’s Threads tab, the *Waiting reason* for the
Rexx-attaching thread will
be a wait on an event/mutex handle; the *owner* (if resolvable) points to
the holder.
Practical fixes / mitigations
*
*Never call back into Rexx while holding the Rexx kernel lock.* In native
ooRexx extensions
this usually means:
o
Minimize the critical section; *release the kernel* before making
Java/BSF calls that
can reenter.
o
If you must call out, *queue* work (post to another thread) instead of
re-entering
synchronously.
*
For *BSF4ooRexx*:
o
Ensure every |AttachThread| is matched with |DetachThread| in a
|try/finally|.
o
Avoid *nested attaches* (|nestAttach| shows on your stack). If you’re
already attached,
reuse the context; don’t attach again.
o
If Java code calls into Rexx, and Rexx code calls back into Java,
enforce a single
direction at a time (e.g., schedule the callback asynchronously).
*
Add diagnostics:
o
On the Rexx side, enable verbose/trace around *kernel lock* operations
if available;
log when you enter/leave Rexx from Java and vice versa.
o
On the Java side, log before/after *BSF calls* and include current
thread name +
whether already attached.
If it’s not a full deadlock (just a stall)
It can still be *head-of-line blocking*: one long-running Rexx activity holds the kernel, and
many threads pile up on |requestAccess()|. The cure is the same—*shorten the locked region* or
make the long task cooperative (yield/release).
If you paste the owner of the Rexx mutex from |!locks| (or a |jstack| snippet showing the
thread doing the call into Rexx while others block), I can point at the exact offender and the
safest place to release the lock.
best regards,
René.
_______________________________________________
Oorexx-devel mailing list
Oorexx-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oorexx-devel
--
Gil Barmwater
_______________________________________________
Oorexx-devel mailing list
Oorexx-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oorexx-devel