Not in any way to assert that ooRexx does not have scaling issues, when I hear hoof-beats I think "horses" not "reindeer". ;-) I would want to be very sure that the long execution times and random anomalies were not caused by my code before asking volunteers to debug it.

-Chip-

On 6/29/2017 11:11 AM, Erich Steinböck wrote:
Hi P.O.,

    executing other peoples code .. please give it a try

I can run code if you provide it as a platform-independent test case. I have no Mac, and the Mac binaries you provide won't run on Ubuntu or Windows, the platforms I can test on.

    tr.rex gets stuck in the routine split_data

The code in tr.rex is inefficient when applied to a large data set. It seems to be likely that this leads to the very long run-time and high memory consumption you experience. Let me give you an example of the gains that may be achieved by coding things differently: using string append and wordPos(), this code will take a minute to execute for n = 100000

call random , , 42
rs = ""
do n
   r = random(1, n % 2)
   if rs~wordPos(r) = 0 then do
     rs = rs r
     stem.r = r
   end
   else
     stem.r = stem.r r
end
say time("e")~format(, 2) "sec"

Achieving something very similar using StringTable and Array, will run in a tenth of a second for the same n

call time "r"
call random , , 42
table = .StringTable~new(n)
do n
   r = random(1, n % 2)
   if \table~hasIndex(r) then
     table[r] = .Array~of(r)
   else
     table[r]~append(r)
end
say time("e")~format(, 2) "sec"

A change like this gives a 600-fold improvement, and the numbers you are working with are much larger than 100000.


I also noted, that the DE-EN-Cleaned.txt you provide, contains more than 56% duplicate lines - cleaning this might also bring some improvement.
The gains are getting




On Wed, Jun 28, 2017 at 11:46 PM, P.O. Jonsson <oor...@jonases.se <mailto:oor...@jonases.se>> wrote:


    Hälsningar/Regards/Grüsse,
    P.O. Jonsson
    oor...@jonases.se <mailto:oor...@jonases.se>

    Hello again Erich,

    I know executing other peoples code can be a p.i.t.a. but please
    give it a try. Se it as a golden opportunity to stress test ooRexx :-)

    Am 28.06.2017 um 21:21 schrieb Erich Steinböck
    <erich.steinbo...@gmail.com <mailto:erich.steinbo...@gmail.com>>:

        Please download the complete test set and let it run and

    I neither have a Mac nor do I have 50 GB of memory

    I can share my machine over remote logon if that would help or we
    can try to look at it using a shared screen. You do not need much
    memory to run the program, 5 GB is more than sufficient for ONE
    instance of the program, and that is enough to simulate the problem.


        I had a REPRODUCIBLE scenario where this problem occurs
        Out of 1200 or so runs it was only this single run that
        produced memory bloating

        see if you can reproduce the memory problem

    Can you explain the problem in more detail? What exactly happens
    when you run which command with what arguments? What are you
    expecting to happen instead and why?

    The problem is that the program tr.rex gets stuck in the routine
    split_data (in the main loop when I break it)) or in sort_data (on
    the ~Stablesort, presumably) for 1000 times longer for certain
    intermediate data (read below) than for other. It is not so much
    more data compared to other runs that I would expect this memory
    load. While being in one of these routines the memory allocation
    for the rexx process goes up and up and up until you have no more
    memory (and start to swap). At the beginning the memory allocated
    to the rexx process is negligible so you can try it with any
    memory that runs.

        it finished in 7 hours 1200 individual ooRexx processes

    What does "1200 individual ooRexx processes" mean? Are you
    starting you program with 1200 different sets of arguments?
    Sequentially or in parallel? Which one of the programs shows the
    issue? Is it always the same one?

    In order to use all cores/threads on my machine I use a bash shell
    script to launch/spawn up to at most 24 instances of the same
    program in parallel, running on the *same* data but with different
    parameters, producing *different* intermediate data files (_RAW
    files) that are read and processed in Split_data and handed over
    to Sort_data. When one chunk of data is processed that process
    finishes (tr.rex exits) and another one is started to do the same
    over and over again up to around 1200 individual runs for one
    batch. There is only one rexx program and the problem only arises
    for specific parameters in combination with specific input data. I
    have provided you two examples, one that runs like a charm and
    another one that never finishes.

        why is the interpreter not warning me when I overwrite an
        object with a string?

    You're not overwriting an object with a string, you're changing
    a variable from referring to one object to referring to another
    one.  That's totally normal .. similar to coding a = 1; a = 2;

    I don't think this is normal but never mind, I never liked objects
    anyway :-) When I started using Rexx the credo was  „Everything is
    a string“. And I am still in the habit of programming like that,
    hence the code you see before you.

    In the past (4.1, 4.2? If I did a /say myMutableBuffer/ it
    reported „A Mutable Buffer“ or something, nowadays I get the value
    stored in the MB. Is there a way to check what kind of object you
    are referring to? A ~whatAreYou method. Useful when you look for
    mistakes in your code (I occasionally write imperfect code,
    unfortunately).


    On Tue, Jun 27, 2017 at 10:26 PM, P.O. Jonsson
    <oor...@jonases.se <mailto:oor...@jonases.se>> wrote:

        "maybe it is just bad programming“

        I guess I had it coming…

        Thanks Erich for your advice, I will consider it all, but my
        intention with this report was another one; for the first
        time I had a REPRODUCIBLE scenario where this problem
        occurs. Out of 1200 or so runs it was only this single run
        that produced memory bloating so my assumption was that is
        was not ONLY :-) bad programming.

        Please download the complete test set and let it run and see
        if you can reproduce the memory problem I have. If so it is
        easy for you to just improve the code and see where the
        problem goes away. I have a feeling I am stuck at

        a = a~StableSort

        For quite some time, maybe because of unfavorable data. But
        I can´t tell for sure.

        PS I had the program run again overnight, it finished in 7
        hours 1200 individual ooRexx processes  with no problem. In
        another run I am now at 53 GB in a single process running at
        100% CPU for 10 hours.

        Question on Mutable Buffers (there is a lot of *NEW* there):
        I understand I need to ~append or ~insert for the MB but why
        is the interpreter not warning me when I overwrite an object
        with a string? Why is that not an error? Is there a reason
        why it should be allowed to destroy an object like I did?

        Hälsningar/Regards/Grüsse,
        P.O. Jonsson
        oor...@jonases.se <mailto:oor...@jonases.se>




        Am 27.06.2017 um 17:15 schrieb Erich Steinböck
        <erich.steinbo...@gmail.com
        <mailto:erich.steinbo...@gmail.com>>:

            maybe it is just bad programming

        Hi P.O.,
        I had a look at Split_data and as far as I can see there
        are a lot of things which can be improved.

        1)

You may want to re-read how to work with a MutableBuffer. E. g.

          tempMB          = .mutablebuffer~new('')
          do while ..
            tempMB = qfileIn~linein

        Initializing a variable with a MutableBuffer instance, and
        afterwards assigning it a String (linein() resturns a
        String) doesn't make sense.

        I can see quite a few instances of this, e. g.

          TranslatedMB    = .mutablebuffer~new('')
          do while ..
            DO i=1 TO i_End
              DO j=1 TO j_End

                  TranslatedMB = TranslatedMB TranslateWordMB

        Again, the final TranslatedMB assignment is not what the
        ..MB ending of the variables suggest.

        2)

        You might move invariant stuff (here: LeftWordsMB~Word(i)
        || '-') in an inner loop outside the loop, e.g.

              DO j=1 TO j_End
                TranslateWordMB = LeftWordsMB~Word(i) || '-' ||
        RightWordsMB~Word(j)


        3)

        Consider using use a single startsWith() instead of the
        code between lines 448 and 485

        4)

                IF TranslatedMB~WordPos(TranslateWordMB) > 0 THEN
                ..
                ELSE
                DO
                  TranslatedMB = TranslatedMB TranslateWordMB

        Instead of building a long string of all things seen
        before, and checking with wordPos(), you might instead put
        all things seen into a Set and check with hasIndex()

        5)

        Generally, using Arrays may be more efficient if you can
        save the Stem.0 handling
        But then, using the proper type of Collection and
        appropriate algorithm may help much more
        To give suggestions for that, I'd need more detail would on
        what exactly you would like to achieve

        On Tue, Jun 27, 2017 at 7:55 AM, P.O. Jonsson
        <oor...@jonases.se <mailto:oor...@jonases.se>> wrote:

            Dear developers,

            I have had the memory bloating problem again, this time
            I reached 48 GB (the maximum for one CPU in my machine)
            and the process only ended after some 13 CPU hours with
            100% CPU the whole time.




            From the logging info I could confirm that the program
            was stuck somewhere here most of the time, here are the
            rough steps

            Language pairs detected in C routine-> External call,
            no memory bloating
            Data processing finished after 2107 Seconds 00:58:12
            Splitting finished after 49487 Seconds 14:42:59*->
            Routine Split_data*
            Sorting finished after 16527 Seconds 19:18:27*->
            Routine Sort_data*
            Processing of Data file finished after 68123 Seconds
            Writing the Logfile TR_DE-EN-eu_logfile.txt 26 Jun 2017
            19:18:28

            I have enclosed the Routines in question.

            In my dropbox I have stored the complete program with
            some test data to replicate the processing, the problem
            is reproducible. Just put the folder somewhere, move
            there and perform the command indicated.

            
https://www.dropbox.com/sh/vettlcb4f8ae3cw/AACWIQivo_F2KhhytJ6izkbFa?dl=0
            
<https://www.dropbox.com/sh/vettlcb4f8ae3cw/AACWIQivo_F2KhhytJ6izkbFa?dl=0>

            I run Open Object Rexx Version 5.0.0, Build date: May
            20 2017, Addressing mode: 64
            Hardware Mac Pro with dual-CPU Xeon Processors running
            Mac OS Sierra 10.12.5

            PS as I was making the screenshot the process finished
            nicely, no crash or anything and the memory was
            released. So maybe it is just bad programming, but at
            least you can confirm that then :-)




            Hälsningar/Regards/Grüsse,
            P.O. Jonsson
            oor...@jonases.se <mailto:oor...@jonases.se>





            
------------------------------------------------------------------------------
            Check out the vibrant tech community on one of the
            world's most
            engaging tech sites, Slashdot.org
            <http://slashdot.org/>! http://sdm.link/slashdot
            _______________________________________________
            Oorexx-devel mailing list
            Oorexx-devel@lists.sourceforge.net
            <mailto:Oorexx-devel@lists.sourceforge.net>
            https://lists.sourceforge.net/lists/listinfo/oorexx-devel
            <https://lists.sourceforge.net/lists/listinfo/oorexx-devel>


        
------------------------------------------------------------------------------
        Check out the vibrant tech community on one of the world's most
        engaging tech sites, Slashdot.org <http://slashdot.org/>!
        http://sdm.link/slashdot_______________________________________________
        
<http://sdm.link/slashdot_______________________________________________>
        Oorexx-devel mailing list
        Oorexx-devel@lists.sourceforge.net
        <mailto:Oorexx-devel@lists.sourceforge.net>
        https://lists.sourceforge.net/lists/listinfo/oorexx-devel
        <https://lists.sourceforge.net/lists/listinfo/oorexx-devel>


        
------------------------------------------------------------------------------
        Check out the vibrant tech community on one of the world's most
        engaging tech sites, Slashdot.org <http://Slashdot.org>!
        http://sdm.link/slashdot
        _______________________________________________
        Oorexx-devel mailing list
        Oorexx-devel@lists.sourceforge.net
        <mailto:Oorexx-devel@lists.sourceforge.net>
        https://lists.sourceforge.net/lists/listinfo/oorexx-devel
        <https://lists.sourceforge.net/lists/listinfo/oorexx-devel>


    
------------------------------------------------------------------------------
    Check out the vibrant tech community on one of the world's most
    engaging tech sites, Slashdot.org <http://Slashdot.org>!
    http://sdm.link/slashdot_______________________________________________
    <http://sdm.link/slashdot_______________________________________________>
    Oorexx-devel mailing list
    Oorexx-devel@lists.sourceforge.net
    <mailto:Oorexx-devel@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/oorexx-devel
    <https://lists.sourceforge.net/lists/listinfo/oorexx-devel>


    
------------------------------------------------------------------------------
    Check out the vibrant tech community on one of the world's most
    engaging tech sites, Slashdot.org! http://sdm.link/slashdot
    _______________________________________________
    Oorexx-devel mailing list
    Oorexx-devel@lists.sourceforge.net
    <mailto:Oorexx-devel@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/oorexx-devel
    <https://lists.sourceforge.net/lists/listinfo/oorexx-devel>




------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot



_______________________________________________
Oorexx-devel mailing list
Oorexx-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oorexx-devel



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Oorexx-devel mailing list
Oorexx-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oorexx-devel

Reply via email to