This patch makes the crashes go away with the test program:

   Index: interpreter/concurrency/ActivityManager.cpp
   ===================================================================
   --- interpreter/concurrency/ActivityManager.cpp      (revision 13003)
   +++ interpreter/concurrency/ActivityManager.cpp      (working copy)
   @@ -568,6 +568,8 @@
      */
     void ActivityManager::clearActivityPool()
     {
   +    ResourceSection lock;                // critical section, protect 
interaction with activity queues
   +
         Activity *activity = (Activity *)availableActivities->pull();
         while (activity != OREF_NULL)
         {
   @@ -591,6 +593,8 @@
      */
     bool ActivityManager::poolActivity(Activity *activity)
     {
   +    // ResourceSection lock held by caller: bool 
InterpreterInstance::poolActivity(Activity *activity
   +
         // are we shutting down or have too many threads in the pool?
         if (processTerminating || availableActivities->items() > 
MAX_THREAD_POOL_SIZE)
         {

clearActivityPool() was not sheltered as a critical section by a ResourceSection lock. The test program now runs without crashes and a magnitude or two faster than before, which is a very pleasant side effect.

As running the test suite works with no crashes, I will apply the patch to trunk such that Jenkins can produce and test Rexx interpreters on all of the Jenkins platforms.

---

A request in this context: debugging this has been an enormous, almost irrational effort, also, because there is important conceptual information about the interpreter missing for those who have not created it, hence supplying this information is important to make sure that future generations can learn, understand the architecture quickly to allow newcomers to maintain and/or extend the interpreter for good! E.g. the ResourceSection serializes access to the various activity queues which is important, however from the name this is not clear and there is no comment that explains this, at least I did not find any. Or, what do the different locks shelter, serialize for what reason. Or, which locks can be used in which sequence (this is buried/scattered in comments in the source code, yet, getting quickly a conceptual overview of the interpreter would be possible, if there was a conceptual overview of the architecture and most important parts of the interprter, including a conceptual overview about flattening, rxapi and the like). This would benefit anyone in this community who has a desire to understand the interpreter in an easy, quick manner!

---rony


On 06.08.2025 16:56, Rony G. Flatscher wrote:

Continuing to experiment with Sleep(23) to see where it would excercise a positive effect if not crashing immediately at terminating the second instance on another thread:

  * placed "Sleep()" in "Activity *ActivityManager.cpp::getRootActivity()"

      o doing a "Sleep(23);" before the statement "Activity *oldActivity = 
findActivity();":
          + crashes at terminating the second instance a couple of times, but 
eventually it ran to
            the end, creating 5,000 instances and terminating each on a 
separate thread

      o moving the "Sleep(23);" before the statement "Activity *activityObject =
        createCurrentActivity();":
          + same as above, a crash terminating the second instance on a 
separate thread,
            re-running the program in the same session may yield success in 
creating and
            terminating 5,000 instances

  * placed "Sleep()" in "Activity *ActivityManager::createCurrentActivity()"

      o doing a "Sleep(23);" before the statement "Activity *activity = new 
Activity(p, false);":
          + same as above, crashes at terminating the second instance a couple 
of times, but
            eventually it ran to the end, creating 5,000 instances and 
terminating each on a
            separate thread

      o moving the "Sleep(23);" before the statement 
"allActivities->append(activity);":
          + crashed once terminating the second instance on a separate thread, 
re-running the
            compiled program in the same session yields success in creating and 
terminating all
            5,000 instances, running it on new sessions succeeds!

So the Sleep(23) has an  effect while appending the new activity to the allActivities queue in createCurrentActivity(), as if concurrent access to the allActivities queue (a QueueClass) may somehow cause a problem (the test program creates 5,000 interpreter instances and uses them on the main thread to output a dot to stdout and immediately after using it, the instance gets terminated on a separate thread).

Since placing the Sleep(23) right in front of allActivities->append(activity) in createCurrentActivity(), the test program seems to succeed when run in a separate, new session repeatedly.

Any ideas, suggestions?

---rony

P.S.: Enclosed the test program as used for these tests: "nmake Makefile2.windows" and "testTerminate2.exe 1".


6:12


On 05.08.2025 12:51, Rony G. Flatscher wrote:

It seems that the crash can be delayed by a couple of thousand Rexx interpreter instances if adding (Windows) "Sleep(20);" right before "rootActivity = ActivityManager::getRootActivity();" in Interpreter.cpp's "InterpreterInstance *Interpreter::createInterpreterInstance(RexxOption *options)".

Although this is a huge improvement to crashing after three interpreter instance creations after having terminated one interpreter instance on another thread, it does not solve the problem. So a real fix is needed (the problem may occur on any host application that wishes to employ ooRexx scripts as macros, potentially creating and terminating interpreter instances on different threads, and a crashing ooRexx would crash the entire process and hence the host application, which is not acceptable for companies to employ ooRexx as a scripting language for their applications, which would be really a boon for many).

Ultimately, it seems, the crash is linked to invoking "Activity *ActivityManager::createCurrentActivity()" in ActivityManager.cpp. createCurrentActivity() gets invoked via createInterpreterInstance()'s invocation of getRootActivity().  The root cause for the crash seems to be that the interpreter instance gets created sometimes on the wrong Actvity/thread. Not sure why doing a Sleep(20) before invoking getRootActivity() has such a dramatic positive effect.

[The other path to invoking createCurrentActivity() is via "Activity *ActivityManager::attachThread()". Note, the test program does not cause attachThread() to be invoked as it immediately uses the created interpreter instance to create and run a routine and then terminates it on a separate thread.]

Not sure how to proceed. Any ideas, suggestions?

---rony


On 30.07.2025 16:50, Rony G. Flatscher wrote:

Using this e-mail thread to report new insights.

Prolog: first of all, it is a little bit bewitched: sometimes it works, sometimes it doesn't. Then, after many, many hours of all sort of debuggings I arrived at a situation where I could reproduce the crash. And then, after further inspections and debug output edits, the crashes would disappear and it took me again hours to become able to get to a version that would crash in a certain constellation. Because of this I do not dare to edit (make it better readable) the source of the test program as that caused the vanishment of the crash.

So enclosed you will find the patch "getInterpreterInfosWithIds.diff" that produces debug information for interpreter instance creations and terminations. The enclosed program "testTerminate2.cpp" can be compiled with "nmake Makefile2.windows" and produces "testTerminate2.exe".

What is the purpose of testTerminate2.cpp?

  * It creates an interpreter instance and uses it to display the ooRexx 
version and then
    terminates it, followed by a loop which gets carried out five times in 
which a Rexx
    interpreter instance gets created, and a little Rexx program
    ("tid=.context~thread;.error~charOut('<{tid='tid'}>');") gets executed on 
the main thread.

  * Depending on the command line arguments the termination of each Rexx 
interpreter instance in
    the loop occurs:
      o (case 1) on the main thread, if no arguments are supplied: this works 
flawlessly

      o on a separate thread, if one or two arguments are supplied: this may 
crash or succeed, if

          + (case 2) one argument gets supplied: this crashes in "bool
            InterpreterInstance::terminate()" after invoking the statement:
            "memoryObject.collectAndUninit(Interpreter::lastInstance());"

          + (case 3) two arguments get supplied: this works MOST of the time, 
and if it crashes
            it does so as in (case 2) above

With the enclosed version I have not yet managed to crash (case 3) which seems to be just a matter of time/luck.

However with (case 2) I get the reported crash. The difference between (case 2) and (case 3), which both terminate the Rexx interpreter instance on a separate thread, is "harmless", as the only difference in (case 3) is that it carries out a "fflush(stderr)" before terminating the instance (not sure why that has such a significant effect in this case).

Here the output of running the three cases:

  * (case 1) output:
    G:\tmp\orx\bugs\crashOnTerminate20250729>rxenv testTerminate2.exe
    *** main() - entered testTerminate2.main() ***

    
<IntInst::inititalize#175/this=I1/thread=T1><IntInst::terminate#529\this=I1\thread=T1>

    <IntInst::inititalize#175/this=I2/thread=T1>Created interpreter instance 
version=5.2.0 language level=6.05
    ... terminating this interpreter instance 
...<IntInst::terminate#529\this=I2\thread=T1>

    
<IntInst::inititalize#175/this=I3/thread=T1><IntInst::terminate#529\this=I3\thread=T1>
      done.

    There is [1] argument, therefore:
             testTerminate2.exe: creating total of [5] interpreters, running 
[tid=.context~thread;.error~charOut('<{tid='tid'}>');]
                     terminating on SAME thread


    
<IntInst::inititalize#175/this=I4/thread=T1><{tid=1}><IntInst::terminate#529\this=I4\thread=T1>

    
<IntInst::inititalize#175/*this=I5/thread=T1*><IntInst::terminate#529\this=I5\thread=T1>

    
<IntInst::inititalize#175/this=I6/thread=T1><{tid=1}><IntInst::terminate#529\this=I6\thread=T1>

    
<IntInst::inititalize#175/*this=I7/thread=T1*><IntInst::terminate#529\this=I7\thread=T1>

    
<IntInst::inititalize#175/this=I8/thread=T1><{tid=1}><IntInst::terminate#529\this=I8\thread=T1>

    
<IntInst::inititalize#175/*this=I9/thread=T1*><IntInst::terminate#529\this=I9\thread=T1>

    
<IntInst::inititalize#175/this=I10/thread=T1><{tid=1}><IntInst::terminate#529\this=I10\thread=T1>

    
<IntInst::inititalize#175/*this=I11/thread=T1*><IntInst::terminate#529\this=I11\thread=T1>

    
<IntInst::inititalize#175/this=I12/thread=T1><{tid=1}><IntInst::terminate#529\this=I12\thread=T1>

    
<IntInst::inititalize#175/*this=I13/thread=T1*><IntInst::terminate#529\this=I13\thread=T1>

    ---
    *** main() - done: created [5] instances, ran 
[tid=.context~thread;.error~charOut('<{tid='tid'}>');], each terminated on the 
SAME thread
    ---
    *** main() - exiting testTerminate2.main() after Sleep(500); ***


  * (case 2) output:
    G:\tmp\orx\bugs\crashOnTerminate20250729>rxenv testTerminate2.exe 1
    *** main() - entered testTerminate2.main() ***

    
<IntInst::inititalize#175/this=I1/thread=T1><IntInst::terminate#529\this=I1\thread=T1>

    <IntInst::inititalize#175/this=I2/thread=T1>Created interpreter instance 
version=5.2.0 language level=6.05
    ... terminating this interpreter instance 
...<IntInst::terminate#529\this=I2\thread=T1>

    
<IntInst::inititalize#175/this=I3/thread=T1><IntInst::terminate#529\this=I3\thread=T1>
      done.

    There are [2] arguments, therefore:
             testTerminate2.exe: creating total of [5] interpreters, running 
[tid=.context~thread;.error~charOut('<{tid='tid'}>');]
                     terminating on SEPARATE thread


    <IntInst::inititalize#175/this=I4/thread=*T1*><{tid=1}>
    Loop # [1], created thread [0000000000000210] for terminating instance ...
    <IntInst::terminate#529\this=I4\thread=*T2*>

    
<IntInst::inititalize#175/*this=I5/thread=T2*><IntInst::terminate#529\this=I5\thread=*T2*>

    <IntInst::inititalize#175/this=I6/thread=T1><{tid=1}>
    Loop # [2], created thread [0000000000000214] for terminating instance ...
    <IntInst::terminate#529\this=I6\thread=T3>

    G:\tmp\orx\bugs\crashOnTerminate20250729>/** CRASH ** in 
memoryObject.collectAndUninit(Interpreter::lastInstance()); /
  * (case 3) output:
    G:\tmp\orx\bugs\crashOnTerminate20250729>rxenv testTerminate2.exe 1 1
    *** main() - entered testTerminate2.main() ***

    
<IntInst::inititalize#175/this=I1/thread=T1><IntInst::terminate#529\this=I1\thread=T1>

    <IntInst::inititalize#175/this=I2/thread=T1>Created interpreter instance 
version=5.2.0 language level=6.05
    ... terminating this interpreter instance 
...<IntInst::terminate#529\this=I2\thread=T1>

    
<IntInst::inititalize#175/this=I3/thread=T1><IntInst::terminate#529\this=I3\thread=T1>
      done.

    There are [3] arguments, therefore:
             testTerminate2.exe: creating total of [5] interpreters, running 
[tid=.context~thread;.error~charOut('<{tid='tid'}>');]
                     terminating on SEPARATE thread (doing a 'fflush(stderr)' 
in termination thread)


    <IntInst::inititalize#175/this=I4/thread=T1><{tid=1}>
    Loop # [1], created thread [0000000000000210] for terminating instance ...

    
<IntInst::inititalize#175/this=I5/thread=T1><IntInst::terminate#529\this=I4\thread=T2>
    <{tid=1}>
    Loop # [2], created thread [000000000000022C] for terminating instance ...

    
<IntInst::inititalize#175/this=I6/thread=T1><IntInst::terminate#529\this=I5\thread=T3>
    <{tid=1}>
    Loop # [3], created thread [0000000000000218] for terminating instance ...

    
<IntInst::inititalize#175/this=I7/thread=T1><IntInst::terminate#529\this=I6\thread=T4>
    <{tid=1}>
    Loop # [4], created thread [0000000000000224] for terminating instance ...

    
<IntInst::inititalize#175/this=I8/thread=T1><IntInst::terminate#529\this=I7\thread=T5>
    <{tid=1}>
    Loop # [5], created thread [0000000000000258] for terminating instance ...

    ---
    *** main() - done: created [5] instances, ran 
[tid=.context~thread;.error~charOut('<{tid='tid'}>');], each terminated on 
SEPARATE threads
    ---
    <IntInst::terminate#529\this=I8\thread=*T6*>

    
<IntInst::inititalize#175/*this=I9/thread=T6*><IntInst::terminate#529\this=I9\thread=*T6*>
    *** main() - exiting testTerminate2.main() after Sleep(500); ***


The creation of the Rexx interpreter instances by the test program occurs on thread T1 on which the Rexx program gets carried out (producing the string "<{tid=1}>").

(case 1) creates an interpreter in the loop and terminates it on the same thread. Creating and terminating another interpreter instance thereafter works without a crash.

The difference between (case 2) which causes a crash and (case 3) seems to be the fact, that (case 2) shuts down with I5 on T2, but continues to create a new interpreter I6 on T1, terminates it on T3 and there the crash occurs (in the memoryObject.collectAndUninit() call branch).

(case 3) on the other hand does not shut down the interpreter, except after the last created interpreter I8 got terminated on T6, causing I9 on T6 to shut down, which works without a crash in this case.

Maybe these observation help shed light on what might be causing the crash.

---rony


On 23.12.2023 18:03, Rony G. Flatscher wrote:

Since last August a crash got reported in <https://sourceforge.net/p/oorexx/bugs/1872/> which can be easily reproduced by following the directions there. Using the supplied simple C++-only program will work flawlessly if invoked simply as

    testTerminate.exe

it will run flawlessly using 250 Rexx interpreter instances to execute a simple script (issuing a single dot on stdout) and then terminate it on the same thread.

... cut ...

_______________________________________________
Oorexx-devel mailing list
Oorexx-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oorexx-devel

Reply via email to