Re: [Oorexx-devel] Question ad a hang situation

Rony G. Flatscher Thu, 14 Aug 2025 06:08:35 -0700

Dear René,

On 14.08.2025 03:52, René Jansen via Oorexx-devel wrote:

These problems can be very timing specific - I have seen situations where an unexpected slowdownor speedup of an application caused a world of trouble. Important lessons are: you have to trustthe implementation of your mutexes; for example OS/2 had a very unsafe one. The 'window oftrouble' needs to be as small as possible, and a slight speedup can cause your code to avoid it. Ithink that we need to take the suggestion to not jump into Rexx while holding a lock should betaken seriously. In my last job before the current one I debugged some Java problems involvinglocks on memory objects and you'd never guess what the cause was: a programmer had read all thedocumentation and used IBM Websphere very bulletproof multithreaded pooling mechanism. He dideverything right and the design was great, until I noticed that he did not cast his objects to theright classes, so the mechanism could never find equality and the server lock up solid every time.This was diagnosed with a Java profiling tool called GlowRoot - it is free and open source and italso ran on z/OS J9 Java; its profiling of the memory locks was spot-on.
A good locking strategy often involves a protocol for locking the critical sections always in thesame order, so there is a chance of escaping instead of running into a deadlock. All contentioussituations are potential performance and continuity problems.
Now for: it was working .. if you have a reproducible case, you should be able to use git bisecton the git copy of the oorexx codebase to determine exactly when the possible error was introduced(although you see in the previous text that I do not assume it is that easy).
Anyway, git bisect enables you to choose two points in history, and do a binary search to buildand indicate good or bad. It then provably takes the minimal number of steps to find out whichcommit was bad. Then I would use the AI-suggested approach to find out where it goes wrong.

thank you for your thoughts and ideas!

Best regards

---rony

best regards,

René.
On 13 Aug 2025, at 20:01, Rony G. Flatscher <rony.flatsc...@wu.ac.at> wrote:

Yes, impressive, indeed, thank you, René.
However there is one important piece of information that is missing: that application did work acouple of years ago, and sometimes works, mostly on Linux and macOS, if it does. Therefore Ithink that in principle everything is set out correctly, but that a situation arises that causesthat hang. Having spent quite some time with that area of the interpreter I was hoping to getsome hints, ideas, theories what could be a possible reason for it. Granted, this is anoptimistic request, but hey, if one does not try one would not get a "lucky punch" hint. If thereare no ideas, then I need to systematically go through the code which may take a lot of time andeffort.
---rony


On 13.08.2025 16:08, Gilbert Barmwater via Oorexx-devel wrote:
WOW! Unbelievable that AI could do that, at least to me. If most of that is, in fact,meaningful - and I have no way of knowing if it is or isn't, way over my head - this is asignificant addition to the ability to debug complex code problems. I have my fingers crossedthat this will help Rony find his problem because I want to believe in this approach. Thanks forsharing René!
Gil

On 8/13/2025 9:53 AM, René Jansen via Oorexx-devel wrote:
I asked my buddy AI for you:
Short version: almost everything here is *blocked, waiting on kernel objects/events*. Onethread (the one with |rexx.dll| in the stack) is trying to *attach to ooRexx* via BSF4ooRexxwhile the JVM is already involved, and it’s waiting for the *ooRexx kernel mutex*. Meanwhileseveral JVM worker threads are also parked in waits. This pattern screams *lock-order inversion/ deadlock between Java ↔ ooRexx* (likely “call into Rexx while holding something, which callsback into Java, which tries to attach back into Rexx and blocks on the Rexx global lock”).
      What the stacks say

 *

    Repeated tops of stack:
    |ntdll!NtWaitForSingleObject → KernelBase!WaitForSingleObjectEx → 
jvm.dll!...|
    That’s a *parked/waiting thread* (monitor/condition/OS event); not runnable.

 *

    The interesting one (Not Flagged, tid |> 23728|):
    |win32u!NtUserMsgWaitForMultipleObjectsEx → user32!RealMsgWait… → 
rexx.dll!waitHandle →
    SysMutex::request → ActivityManager::lockKernel → Activity::waitForKernel →
    ActivityManager::addWaitingActivity → Activity::requestAccess → 
Activity::nestAttach →
    InterpreterInstance::attachThread → AttachThread → BSF4ooRexx850.dll …|
    This shows a *BSF/ooRexx attach* trying to acquire the *Rexx kernel lock* 
and
    *waiting* (message-wait variant, so it can pump messages).

 *

    Many other JVM threads show the same wait pattern at different internal pcs
    (|jvm.dll!0x7117e75a|, |…e82f|, etc.). That’s consistent with *Java threads 
parked on
    monitors/conditions* (e.g., GC, JIT, RMI, pool workers) while some other 
thread is expected
    to make progress—but isn’t.


      Likely scenario

1.

    A thread entered *ooRexx* and still *holds the Rexx kernel mutex* (ooRexx is
    single-kernel-locked).

2.

    During that work, it *called into Java* (BSF).

3.

    Another thread (or a callback on the same thread through message pumping) 
is now trying to
    *AttachThread / nestAttach* back into ooRexx → it blocks on 
|SysMutex::request()| because
    the kernel lock is still held.

4.

    Because of that, Java side work that would release the original path can’t 
continue →
    *deadlock* (or at least a hard stall).


      How to confirm quickly

 *

    *Grab a full dump* and in WinDbg run:

     o

        |!locks| (who owns the critical section / mutex in |rexx.dll|).

     o

        |~* kb| to see if any thread is inside |rexx.dll| /not/ waiting (the 
probable owner).

 *

    From the Java side, run |jcmd <pid> Thread.print| (or |jstack|) and look 
for:

     o

        Threads in |parking to wait for <…>| with *owning thread* info.

     o

        Any thread doing a *native call into BSF/ooRexx*.

 *

    In Process Explorer’s Threads tab, the *Waiting reason* for the 
Rexx-attaching thread will
    be a wait on an event/mutex handle; the *owner* (if resolvable) points to 
the holder.


      Practical fixes / mitigations

 *

    *Never call back into Rexx while holding the Rexx kernel lock.* In native 
ooRexx extensions
    this usually means:

     o

        Minimize the critical section; *release the kernel* before making 
Java/BSF calls that
        can reenter.

     o

        If you must call out, *queue* work (post to another thread) instead of 
re-entering
        synchronously.

 *

    For *BSF4ooRexx*:

     o

        Ensure every |AttachThread| is matched with |DetachThread| in a 
|try/finally|.

     o

        Avoid *nested attaches* (|nestAttach| shows on your stack). If you’re 
already attached,
        reuse the context; don’t attach again.

     o

        If Java code calls into Rexx, and Rexx code calls back into Java, 
enforce a single
        direction at a time (e.g., schedule the callback asynchronously).

 *

    Add diagnostics:

     o

        On the Rexx side, enable verbose/trace around *kernel lock* operations 
if available;
        log when you enter/leave Rexx from Java and vice versa.

     o

        On the Java side, log before/after *BSF calls* and include current 
thread name +
        whether already attached.


      If it’s not a full deadlock (just a stall)
It can still be *head-of-line blocking*: one long-running Rexx activity holds the kernel, andmany threads pile up on |requestAccess()|. The cure is the same—*shorten the locked region* ormake the long task cooperative (yield/release).
If you paste the owner of the Rexx mutex from |!locks| (or a |jstack| snippet showing thethread doing the call into Rexx while others block), I can point at the exact offender and thesafest place to release the lock.
best regards,

René.


_______________________________________________
Oorexx-devel mailing list
Oorexx-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oorexx-devel
--
Gil Barmwater

_______________________________________________
Oorexx-devel mailing list
Oorexx-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oorexx-devel

Re: [Oorexx-devel] Question ad a hang situation

Reply via email to