On Fri, Jun 30, 2017 at 7:08 AM, Moritz Hoffmann <antig...@gmail.com> wrote:
> Hi P.O.,
> I think the optimizations suggested by others are very relevant to your
> use case. Looking at your you perform many operations that are linear and
> are performed for every input word so roughly quadratic in runtime. Here
> are some examples:
>
> Position = LeftWordsMB~WordPos(LeftWordMB)
>
> The runtime of WordPos is linear in the length of the receiver object,
> i.e. LeftWordMB (which is a string).
>
> LeftWordsMB = LeftWordsMB LeftWordMB
>
> This constructs a new object out of LeftWordsMB and LeftWordMB. It
> allocates memory and copies data, so its time complexity roughly linear in
> the length of the string being combined.
>
> Both operations are performed for every input line(!), which leads to an
> approximately quadratic runtime (actually lines*words, but good enough for
> our analysis). Processing a million lines with a quadratic algorithm can be
> expected to timeout and given you're allocating memory in a loop, also
> exhaust Rexx's heap.
>
> Note that ooRexx can only trigger garbage collection cycles at very
> specific places in the code. I'm not sure which they are right out of my
> head but sometimes tight loops with memory allocation can prevent the GC
> from running. You should make use of better data structured to reduce the
> run-time complexity of your algorithm, then it'll be much easier to help
> you. See the suggestions from the other thread, i.e. use hashing to lookup
> counts.
>
That's not true. A GC operation is triggered any time the memory manager is
unable satisfy a request. When that happens, all of the dead objects are
swept up and the allocation request is reattempted. If it is still unable
to allocate the requested object, then the heap is expanded so that the
object can be allocated. There are certainly some code patterns that can
cause heap expansion. The biggest offenders are using concatenations like
LeftWordsMD = LeftWordsMD LeftWordMB
when the strings involved are quite long. When that happens, it is very
likely that there will not be a block of heap storage large enough to
satisfy the request. A GC request is triggered, the allocation fails, and
the heap is expanded. Once the strings grow really big, this is likely to
happen on every iteration of the loop as progressively larger blocks of
storage need to allocated. A GC operation can be very expensive,
particularly if the program has a large number of live objects.
The MutableBuffer class can fix these sorts of problems, particularly if
you can predict in advance the size you might need. The object is allocated
once and data is just added to it, so the need for progressively larger
chunks of storage goes away.
Rick
>
> Moritz
>
>
> On Fri, Jun 30, 2017 at 12:15 PM, P.O. Jonsson <oor...@jonases.se> wrote:
>
>> Dear all,
>>
>> Just for the record: I did NOT ask anyone to *debug* my code, I
>> volunteered a test case that could systematically exhaust the memory of my
>> machine for a certain configuration whereas for all other data it performed
>> well. Had I desired to be ridiculed I would have turned to Facebook or
>> Twitter :-(
>>
>> I have received constructive feedback from Rick, Erich and others, I know
>> what I have to do, but again, that was not the point with my posting; for
>> that I would post to oorexx-users, and I have done so earlier so all that
>> is in the pipeline.
>>
>> What I have made is just a proof-of-concept, not a candidate for Miss
>> Universe. In its crappy state it is nevertheless producing the correct data
>> output, and the program has ben run some 50 000 times in the last two weeks
>> or so, and only in a few cases (five maybe) do I have the problem I have
>> described here. I have harvested more useful translation data with this
>> program than dict.cc (for EN -> FR and FR -> DE at least) can provide
>> and that using a single corpus, the DCEP.
>>
>> @Erich,
>>
>> If you are still interested in seeing the behavior I have made a
>> monolithic test case that should run on any platform. I have tested it om
>> Mac OS, Windows 7 and Windows 10 and the behavior is the same in all
>> places.
>>
>> Testcases 1-4 run just fine, with the Rexx process starting nicely at
>> 40-50 MB and going up to a maximum of 100 MB before exiting.
>> Testcase 5, the problem case, quickly start to allocate memory and
>> continue to do so until there is none left.
>>
>> I am still puzzled how the same program can allocate 1000 times more
>> memory for the same processing of different data.
>>
>> Download from my Dropbox folder, there is no need to sign in or anything,
>> just download & unpack everything in a folder where you want it.
>>
>> https://www.dropbox.com/sh/vettlcb4f8ae3cw/AACWIQivo_F2KhhytJ6izkbFa?dl=0
>>
>>
>> Hälsningar/Regards/Grüsse,
>> P.O. Jonsson
>> oor...@jonases.se
>>
>>
>>
>>
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Oorexx-devel mailing list
>> Oorexx-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/oorexx-devel
>>
>>
>
>
> --
> Moritz Hoffmann;
> http://antiguru.de/
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Oorexx-devel mailing list
> Oorexx-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/oorexx-devel
>
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Oorexx-devel mailing list
Oorexx-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oorexx-devel