Dear René,

On 14.08.2025 03:52, René Jansen via Oorexx-devel wrote:
These problems can be very timing specific - I have seen situations where an unexpected slowdown or speedup of an application caused a world of trouble. Important lessons are: you have to trust the implementation of your mutexes; for example OS/2 had a very unsafe one. The 'window of trouble' needs to be as small as possible, and a slight speedup can cause your code to avoid it. I think that we need to take the suggestion to not jump into Rexx while holding a lock should be taken seriously. In my last job before the current one I debugged some Java problems involving locks on memory objects and you'd never guess what the cause was: a programmer had read all the documentation and used IBM Websphere very bulletproof multithreaded pooling mechanism. He did everything right and the design was great, until I noticed that he did not cast his objects to the right classes, so the mechanism could never find equality and the server lock up solid every time. This was diagnosed with a Java profiling tool called GlowRoot  - it is free and open source and it also ran on z/OS J9 Java; its profiling of the memory locks was spot-on.

A good locking strategy often involves a protocol for locking the critical sections always in the same order, so there is a chance of escaping instead of running into a deadlock. All contentious situations are potential performance and continuity problems.

Now for: it was working .. if you have a reproducible case, you should be able to use git bisect on the git copy of the oorexx codebase to determine exactly when the possible error was introduced (although you see in the previous text that I do not assume it is that easy).

Anyway, git bisect enables you to choose two points in history, and do a binary search to build and indicate good or bad. It then provably takes the minimal number of steps to find out which commit was bad. Then I would use the AI-suggested approach to find out where it goes wrong.

thank you for your thoughts and ideas!

Best regards

---rony



best regards,

René.

On 13 Aug 2025, at 20:01, Rony G. Flatscher <rony.flatsc...@wu.ac.at> wrote:

Yes, impressive, indeed, thank you, René.

However there is one important piece of information that is missing: that application did work a couple of years ago, and sometimes works, mostly on Linux and macOS, if it does. Therefore I think that in principle everything is set out correctly, but that a situation arises that causes that hang. Having spent quite some time with that area of the interpreter I was hoping to get some  hints, ideas, theories what could be a possible reason for it. Granted, this is an optimistic request, but hey, if one does not try one would not get a "lucky punch" hint. If there are no ideas, then I need to systematically go through the code which may take a lot of time and effort.

---rony


On 13.08.2025 16:08, Gilbert Barmwater via Oorexx-devel wrote:

WOW! Unbelievable that AI could do that, at least to me.  If most of that is, in fact, meaningful - and I have no way of knowing if it is or isn't, way over my head - this is a significant addition to the ability to debug complex code problems.  I have my fingers crossed that this will help Rony find his problem because I want to believe in this approach. Thanks for sharing René!

Gil

On 8/13/2025 9:53 AM, René Jansen via Oorexx-devel wrote:
I asked my buddy AI for you:

Short version: almost everything here is *blocked, waiting on kernel objects/events*. One thread (the one with |rexx.dll| in the stack) is trying to *attach to ooRexx* via BSF4ooRexx while the JVM is already involved, and it’s waiting for the *ooRexx kernel mutex*. Meanwhile several JVM worker threads are also parked in waits. This pattern screams *lock-order inversion / deadlock between Java ↔ ooRexx* (likely “call into Rexx while holding something, which calls back into Java, which tries to attach back into Rexx and blocks on the Rexx global lock”).


      What the stacks say

 *

    Repeated tops of stack:
    |ntdll!NtWaitForSingleObject → KernelBase!WaitForSingleObjectEx → 
jvm.dll!...|
    That’s a *parked/waiting thread* (monitor/condition/OS event); not runnable.

 *

    The interesting one (Not Flagged, tid |> 23728|):
    |win32u!NtUserMsgWaitForMultipleObjectsEx → user32!RealMsgWait… → 
rexx.dll!waitHandle →
    SysMutex::request → ActivityManager::lockKernel → Activity::waitForKernel →
    ActivityManager::addWaitingActivity → Activity::requestAccess → 
Activity::nestAttach →
    InterpreterInstance::attachThread → AttachThread → BSF4ooRexx850.dll …|
    This shows a *BSF/ooRexx attach* trying to acquire the *Rexx kernel lock* 
and
    *waiting* (message-wait variant, so it can pump messages).

 *

    Many other JVM threads show the same wait pattern at different internal pcs
    (|jvm.dll!0x7117e75a|, |…e82f|, etc.). That’s consistent with *Java threads 
parked on
    monitors/conditions* (e.g., GC, JIT, RMI, pool workers) while some other 
thread is expected
    to make progress—but isn’t.


      Likely scenario

1.

    A thread entered *ooRexx* and still *holds the Rexx kernel mutex* (ooRexx is
    single-kernel-locked).

2.

    During that work, it *called into Java* (BSF).

3.

    Another thread (or a callback on the same thread through message pumping) 
is now trying to
    *AttachThread / nestAttach* back into ooRexx → it blocks on 
|SysMutex::request()| because
    the kernel lock is still held.

4.

    Because of that, Java side work that would release the original path can’t 
continue →
    *deadlock* (or at least a hard stall).


      How to confirm quickly

 *

    *Grab a full dump* and in WinDbg run:

     o

        |!locks| (who owns the critical section / mutex in |rexx.dll|).

     o

        |~* kb| to see if any thread is inside |rexx.dll| /not/ waiting (the 
probable owner).

 *

    From the Java side, run |jcmd <pid> Thread.print| (or |jstack|) and look 
for:

     o

        Threads in |parking to wait for <…>| with *owning thread* info.

     o

        Any thread doing a *native call into BSF/ooRexx*.

 *

    In Process Explorer’s Threads tab, the *Waiting reason* for the 
Rexx-attaching thread will
    be a wait on an event/mutex handle; the *owner* (if resolvable) points to 
the holder.


      Practical fixes / mitigations

 *

    *Never call back into Rexx while holding the Rexx kernel lock.* In native 
ooRexx extensions
    this usually means:

     o

        Minimize the critical section; *release the kernel* before making 
Java/BSF calls that
        can reenter.

     o

        If you must call out, *queue* work (post to another thread) instead of 
re-entering
        synchronously.

 *

    For *BSF4ooRexx*:

     o

        Ensure every |AttachThread| is matched with |DetachThread| in a 
|try/finally|.

     o

        Avoid *nested attaches* (|nestAttach| shows on your stack). If you’re 
already attached,
        reuse the context; don’t attach again.

     o

        If Java code calls into Rexx, and Rexx code calls back into Java, 
enforce a single
        direction at a time (e.g., schedule the callback asynchronously).

 *

    Add diagnostics:

     o

        On the Rexx side, enable verbose/trace around *kernel lock* operations 
if available;
        log when you enter/leave Rexx from Java and vice versa.

     o

        On the Java side, log before/after *BSF calls* and include current 
thread name +
        whether already attached.


      If it’s not a full deadlock (just a stall)

It can still be *head-of-line blocking*: one long-running Rexx activity holds the kernel, and many threads pile up on |requestAccess()|. The cure is the same—*shorten the locked region* or make the long task cooperative (yield/release).

If you paste the owner of the Rexx mutex from |!locks| (or a |jstack| snippet showing the thread doing the call into Rexx while others block), I can point at the exact offender and the safest place to release the lock.

best regards,

René.


_______________________________________________
Oorexx-devel mailing list
Oorexx-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oorexx-devel
--
Gil Barmwater
_______________________________________________
Oorexx-devel mailing list
Oorexx-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oorexx-devel

Reply via email to