[opencog-dev] Re: Language Learning - Progress and Performance

Linas Vepstas Sat, 03 Jun 2017 09:58:25 -0700

Hmm.

In my case, my "infinite loop" did not clear up overnight, while I slept,
so I assumed it was infinite.


To those who do threaded programming, there is a situation where there can
be a very small window to unlock a stuck loop.  So, for example, something
might be stuck in a loop, spinning, lock held 99.99% of the time, but there
is a slim chance that some other thread sneaks in during the 0.01% window,
fixes whatever the condition is, then the loop gets unstuck, and the
program eventually finishes without errors.

It's possible that that is the case here -- i.e. it eventually finishes, if
you wait long enough.  Its still a bug. Seems like this is moving to the
top of my priority list, now.

But first, I have to plow through my emails.

--linas


On Sat, Jun 3, 2017 at 11:42 AM, Curtis Faith <[email protected]>
wrote:

> No, I'm using Guile 2.2.2. Indeed I did get that hanging bug earlier
> because the install was originally into /usr/local/lib/ instead of
> /usr/lib/ and I reported an issue on it
> <https://github.com/opencog/opencog/issues/2741> two weeks ago. All my
> tests were done with Guile 2.2.2. after I moved the library into /usr/lib/
> after I got the configuration sorted out.
>
> The hang I found wasn't an infinite loop. It actually finished. The
> CogServer was spending all it's time in the Guile GC, and then when I came
> back after letting it run all weekend on Monday it was finished.
>
> On Sun, Jun 4, 2017 at 12:13 AM, Linas Vepstas <[email protected]>
> wrote:
>
>> Which version of guile are you using? You should be using guile-2.2.
>> sooner or later, you also need to set up bdwgc with the ./configure
>> --enable-large-config or you will get th dreaded `Too many heap sections:
>> Increase MAXHINCR or MAX_HEAP_SECTS` error.
>>
>> Re guile inf loop. Yes, there is a bug in guile, I plan to open a bug
>> report "real soon now".  I think you are describing this bug below. What I
>> see is that:
>> * the guile prompt stops responding
>> * cpu spins at exactly 100% cpu usage.
>> * GDB shows a very small, very tight spin loop, when I single-step : get
>> this:
>>
>> single-stepping:
>>
>> scm_i_vm_mark_stack (vp=0xbdca3120, mark_stack_ptr=0x7f361fe85ef0,
>>     mark_stack_limit=0x7f361fe95eb0) at ../../libguile/vm.c:975
>> 975            fp = SCM_FRAME_DYNAMIC_LINK (fp))
>> (gdb) print fp
>> $1 = (union scm_vm_stack_element *) 0x7f36183d5e38
>> (gdb) step
>> 973       for (fp = vp->fp, sp = vp->sp;
>> (gdb)
>> 979           for (slot = nlocals - 1; sp < fp; sp++, slot--)
>> (gdb)
>> 1011          slot_map = find_slot_map (SCM_FRAME_RETURN_ADDRESS (fp),
>> &cache);
>> (gdb)
>> 1006          sp = SCM_FRAME_PREVIOUS_SP (fp);
>> (gdb) print fp
>> $2 = (union scm_vm_stack_element *) 0x7f36183d5e38
>> (gdb) print sp
>> $3 = (union scm_vm_stack_element *) 0x7f36183d5e48
>> (gdb) step
>> 1011          slot_map = find_slot_map (SCM_FRAME_RETURN_ADDRESS (fp),
>> &cache);
>> (gdb)
>> find_slot_map (cache=0x7f361fe85ab0, ip=0x1) at ../../libguile/vm.c:932
>> 932       size_t slot = (((scm_t_uintptr) ip) >> 2) %
>> SLOT_MAP_CACHE_SIZE;
>> (gdb)
>> 935       if (cache->entries[slot].ip == ip)
>> (gdb)
>> scm_i_vm_mark_stack (vp=0xbdca3120, mark_stack_ptr=0x7f361fe85ef0,
>>     mark_stack_limit=0x7f361fe95eb0) at ../../libguile/vm.c:1011
>> 1011          slot_map = find_slot_map (SCM_FRAME_RETURN_ADDRESS (fp),
>> &cache);
>> (gdb)
>> find_slot_map (cache=0x7f361fe85ab0, ip=0x1) at ../../libguile/vm.c:936
>> 936         map = cache->entries[slot].map;
>> (gdb)
>> scm_i_vm_mark_stack (vp=0xbdca3120, mark_stack_ptr=0x7f361fe85ef0,
>>     mark_stack_limit=0x7f361fe95eb0) at ../../libguile/vm.c:975
>> 975            fp = SCM_FRAME_DYNAMIC_LINK (fp))
>> (gdb) print fp
>> $4 = (union scm_vm_stack_element *) 0x7f36183d5e38
>>
>> This also happens in older version e.g. guile-2.1.8  Its a real bug, I
>> have not had time to tell the guile people about it.
>>
>> weird thing is, everything was fine until about a 2-3 weeks ago, when I
>> started seeing this bug. I was churning through huge quantities of text
>> before this, and it was not an issue. At all. And now it is. I can't figure
>> out why .. as I say, it still hangs with older versions, and I can't figure
>> out what triggered this.
>>
>> --linas
>>
>> On Thu, Jun 1, 2017 at 11:09 PM, Curtis Faith <[email protected]>
>> wrote:
>>
>>> So my intuition about the Guile GC was right about after all.
>>>
>>> I came into the office this morning after leaving last night with a test
>>> of the 6,000 sentence file which took two days last week and started me on
>>> this path before working on AWS.
>>>
>>> For the new test last night, I reintroduced the (gc) call in
>>> observe-text but did not put back the fetch-atom and store-atom calls.
>>>
>>> I figured it would either finish quickly last night (if the SQL issue
>>> was the only major problem) or run into the same problem I encountered last
>>> time I ran it last week (if something else like the GC was causing the
>>> slowdown). Either way I would be able to separate the GC performance issues
>>> I suspected from the SQL performance issues I already noted.
>>>
>>> When I arrived, the cogserver was using 100% but was still running and
>>> processing sentence #590. I suspected the Guile GC, since the CPU is
>>> pegging one process at 100% on a 6CPU machine that usually has 1150% for
>>> the cogserver when running. This matches the profile of one thread blocking
>>> all the others, which seemed like the GC, I fired up gdb, attached to the
>>> cog server, and did a backtrace:
>>>
>>> gdb) bt
>>> #0  0x00007f562938cfb2 in do_sigsuspend (set=0x7f56244ea2c0)
>>>     at ../sysdeps/unix/sysv/linux/sigsuspend.c:31
>>> #1  __GI___sigsuspend (set=0x7f56244ea2c0)
>>>     at ../sysdeps/unix/sysv/linux/sigsuspend.c:45
>>> #2  0x00007f56242ce832 in *GC_suspend_handler_inner* ()
>>>    from /usr/lib/x86_64-linux-gnu/libgc.so.1
>>> #3  0x00007f56242ce8a5 in *GC_suspend_handler* ()
>>>    from /usr/lib/x86_64-linux-gnu/libgc.so.1
>>> #4  <signal handler called>
>>> #5  0x00007f562941a93d in nanosleep () at ../sysdeps/unix/syscall-templa
>>> te.S:81
>>> #6  0x00007f562944bbc4 in usleep (useconds=<optimized out>)
>>>     at ../sysdeps/unix/sysv/linux/usleep.c:32
>>> #7  0x00007f562a2d7fb7 in opencog::CogServer::serverLoop (this=0x19727f0)
>>>     at /home/inflector/opencog/opencog/opencog/cogserver/server/Cog
>>> Server.cc:211
>>> #8  0x0000000000406b05 in main (argc=3, argv=0x7ffec4633ae8)
>>>     at /home/inflector/opencog/opencog/opencog/cogserver/server/Cog
>>> ServerMain.cc:232
>>> (gdb)
>>>
>>> This is the code that the Garbage collector invokes to suspend threads
>>> while running. I've been digging into the code for the collector and found
>>> several potential interactions which will result in what I've been seeing.
>>> Essentially as the number of objects in the cogserver heap grows the
>>> Conservative GC has to search a larger and larger space until eventually
>>> this search takes a single core's entire processing capability. As the
>>> search space grows larger and larger, this blocking begins to dominate all
>>> execution.
>>>
>>> You can see this in my notes in the attached observe-text.log file where
>>> I show the GC time compared with total time and it is 320 seconds of 463
>>> seconds for my 5 chapter test file. So 69% of the processing when
>>> processing only 303 sentences. And this percentage grows with successive
>>> processing of the same file. These successive runs create no new atoms but
>>> show a leak of 0.130G per execution with the (gc) call in observe-text; and
>>> 0.400G per processing of the 5 chapters without the (gc) call inside
>>> observe-text.
>>>
>>> Currently the job is sitting at 12 hours for one single sentence while
>>> the garbage collector is repeatedly running while attempting to complete
>>> it's job. Since I encountered this last week and just let it run over the
>>> weekend, and when I came back on Monday it was finished, I suspect that the
>>> GC is still churning away but being repeatedly invoked and at some point,
>>> it will finish enough to process the sentence. Further the completion will
>>> invoke another internal allocation which it always makes at 200% of the
>>> previous size, this will cause the GC to be invoked much less again for
>>> some period of time which will make the rest of the sentences process much
>>> faster. But at some size the GC stops working as the processing time
>>> required dominates a single CPU core.
>>>
>>> All the above means Guile's current GC won't work for AtomSpace. Not
>>> without some significant work. That's the bad news.
>>>
>>> The good news is that I believe we can implement a fix for this for our
>>> special case of Guile running in a separate process in the CogServer in a
>>> few days. My idea is a bit of a hack but will result in extremely fast
>>> allocations, zero leaks, and speedup the code by at least 2X over the small
>>> file performance case where we're getting sub second processing of lines.
>>> I'd not be surprised if we can hit 0.5 seconds in the no SQL saving case.
>>>
>>> I have attached a screenshot of the top where you can see 100% CPU for
>>> the conserver and Terminal windows where you can see that there were no
>>> errors starting the conserver and that it is on sentence 590 which took
>>> 1049 seconds to get to with an average per-sentence time at that point of
>>> 1.78 seconds. You can also see that there are no atoms in the atom space
>>> (except for the default one that gets added to every atomspace).
>>>
>>> So my plan is: as a first step, I'm going to implement a quick-and-dirty
>>> replacement for the GC which just uses malloc to see if that gets through
>>> the file. If this approach works, this will be confirmation of the above,
>>> but not be suitable for the long-term.
>>>
>>> Assuming the malloc-based allocations complete in reasonable time, I've
>>> got some ideas for altering the GC code to work in a separate per-thread
>>> heap for new allocations which we can then dispose en mass when the thread
>>> terminates. Assuming we can get a decent bound on usage of guile-internal
>>> objects during statement execution, this will be much much faster and won't
>>> leak. Everyone wins.
>>>
>>> Let me know if you have any questions or suggestions.
>>>
>>> - Curtis
>>>
>>> On Thu, Jun 1, 2017 at 1:42 AM, Ben Goertzel <[email protected]> wrote:
>>>
>>>> Thanks a lot Curtis!
>>>>
>>>> I totally grok the desire to complete an analysis before announcing
>>>> results... OTOH if we were all sitting in the same office we'd
>>>> probably be communicating lots of interim results.  So generally I'm
>>>> biased toward fostering communication! ...
>>>>
>>>> See you Monday...
>>>>
>>>> ben
>>>>
>>>> On Wed, May 31, 2017 at 7:58 AM, Curtis Faith <[email protected]>
>>>> wrote:
>>>> > I should have been more explicit and said that removing the store and
>>>> fetch
>>>> > was just a quick and dirty way of measuring the performance impact of
>>>> this
>>>> > issue, not a suggested solution to it.
>>>> >
>>>> > On Wed, May 31, 2017 at 10:13 PM, Curtis Faith <
>>>> [email protected]>
>>>> > wrote:
>>>> >>
>>>> >> I am sending this a bit earlier than I would normally as I am still
>>>> trying
>>>> >> to figure out the contribution of the Guile garbage collector on
>>>> performance
>>>> >> and am not quite done. I have also not yet measured a test over
>>>> significant
>>>> >> data for a before and after comparison, for example.
>>>> >>
>>>> >> But since Ben sent Linas an email about this problem, I'll send my
>>>> work in
>>>> >> progress, so Linas has a bit more information to answer Ben's
>>>> question.
>>>> >>
>>>> >> So, the following represents a partial analysis of the SQL issues I
>>>> found.
>>>> >> I removed the code for fetch-atom and store-atom and while this is
>>>> >> considerably faster, there still seems to be many-second pauses in
>>>> >> processing that are not related to the number of words in the
>>>> sentence in
>>>> >> any way I can easily correlate. Guile garbage collector? This might
>>>> be due
>>>> >> to a memory leak which is resulting in the garbage collector having
>>>> to
>>>> >> search over an ever increasing memory space looking for object-ish
>>>> >> references using it's conservative scheme.
>>>> >>
>>>> >> - Curtis
>>>> >>
>>>> >> -----------------------
>>>> >>
>>>> >> My 6,000 sentence test file finished over the weekend. It took about
>>>> 48
>>>> >> hours.
>>>> >>
>>>> >> For some context, the input corpus, Pride and Prejudice, has
>>>> >> approximately:
>>>> >>
>>>> >> 120 K words
>>>> >> 700 K bytes
>>>> >> 6 K sentences
>>>> >> 2 K paragraphs
>>>> >> 25 average words per sentence
>>>> >>
>>>> >> So it is a fair sized novel but only a tiny tiny fraction of what
>>>> we'd
>>>> >> like to feed the language learning system.
>>>> >>
>>>> >> In looking at the Postgres statistics for updates on a freshly
>>>> initialized
>>>> >> database, I found:
>>>> >>
>>>> >> SELECT relname, idx_scan, idx_tup_fetch, n_tup_ins, n_tup_upd,
>>>> n_tup_del
>>>> >> FROM pg_stat_user_tables;
>>>> >>
>>>> >>
>>>> >>   relname   | idx_scan  | idx_tup_fetch | n_tup_ins | n_tup_upd |
>>>> >> n_tup_del
>>>> >>
>>>> >> ------------+-----------+---------------+-----------+-------
>>>> ----+-----------
>>>> >>  atoms      |  95789119 |      90072278 |   3288742 |         0 |
>>>> >> 0
>>>> >>  values     |         4 |             0 |         0 |         0 |
>>>> >> 0
>>>> >>  typecodes  |         0 |             0 |       228 |         0 |
>>>> >> 0
>>>> >>  valuations | 131581719 |      84520666 |  44519326 |         0 |
>>>> >> 41926158
>>>> >>  spaces     |   3288742 |       3288742 |         2 |         0 |
>>>> >> 0
>>>> >> (5 rows)
>>>> >>
>>>> >> So roughly 90M fetches for Atoms and 85M for Valuations while
>>>> parsing a
>>>> >> single file of 6K sentences.
>>>> >>
>>>> >> We have 3.3M atoms inserted and 90.0M index tuple fetches for the
>>>> atom
>>>> >> table.
>>>> >>
>>>> >> In digging into the code, i found we have an explicit delete and
>>>> insert
>>>> >> for every single increment of every atom count maintained by the
>>>> system. The
>>>> >> count increment code is also recursive for the trees defined by each
>>>> pair
>>>> >> which will in practice be many different atoms because even though
>>>> we only
>>>> >> increment the count for one atom in the tree, the store-atom and
>>>> fetch-atom
>>>> >> calls update the truth values for every atom in the tree that has a
>>>> truth
>>>> >> value. The pair counting code has this comment:
>>>> >>
>>>> >> ; ------------------------------------------------------------
>>>> ---------
>>>> >> ; update-pair-counts -- count occurances of word-pairs in a parse.
>>>> >> ;
>>>> >> ; Note that this might throw an exception...
>>>> >> ;
>>>> >> ; The structures that get created and incremented are of the form
>>>> >> ;
>>>> >> ;     EvaluationLink
>>>> >> ;         PredicateNode "*-Sentence Word Pair-*"
>>>> >> ;         ListLink
>>>> >> ;             WordNode "lefty"  -- or whatever words these are.
>>>> >> ;             WordNode "righty"
>>>> >> ;
>>>> >> ;     ExecutionLink
>>>> >> ;         SchemaNode "*-Pair Distance-*"
>>>> >> ;         ListLink
>>>> >> ;             WordNode "lefty"
>>>> >> ;             WordNode "righty"
>>>> >> ;         NumberNode 3
>>>> >>
>>>> >>
>>>> >> So every pair has 11 atoms.
>>>> >>
>>>> >> The truth value gets set only at the top level but the fetch and
>>>> store end
>>>> >> up retrieving and storing the valuations for the entire tree where
>>>> there are
>>>> >> truth values.
>>>> >>
>>>> >> Further, the counting code is storing and and fetching the truth
>>>> values
>>>> >> for each unique count value each time it gets incremented. So we'll
>>>> have (1:
>>>> >> an insert for 1), then (2: a delete of the 1, and an insert for 2),
>>>> then (3:
>>>> >> a delete of the 2, and an insert for 3), (4: a delete for 3, and an
>>>> insert
>>>> >> for 4), etc. And this is done for the EvaluationLink, ExecutionLink
>>>> and four
>>>> >> Word Nodes in the above example.
>>>> >>
>>>> >> I wanted to verify this assumption, so I created a fresh database and
>>>> >> turned on statement logging in Postgres. It resulted in the attached
>>>> log
>>>> >> file for a test of the a four-word sentence:
>>>> >>
>>>> >> (observe-text "This is a test.")
>>>> >>
>>>> >> We got 5419 lines in the log file (not counting the 333 lines of
>>>> startup
>>>> >> creations of types etc.), each of which represents a single SQL
>>>> statement.
>>>> >> There are:
>>>> >>
>>>> >> 813 BEGIN,
>>>> >> 813 COMMIT,
>>>> >> 2127 SELECT,
>>>> >> 720 DELETE, and
>>>> >> 946 INSERT
>>>> >>
>>>> >> statements for a four word sentence. (See the attached text log for
>>>> >> details)
>>>> >>
>>>> >> There are 133 new atoms created and 93 valuations after the call to
>>>> >> observe-text is finished.
>>>> >>
>>>> >>
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Ben Goertzel, PhD
>>>> http://goertzel.org
>>>>
>>>> "I am God! I am nothing, I'm play, I am freedom, I am life. I am the
>>>> boundary, I am the peak." -- Alexander Scriabin
>>>>
>>>
>>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA35vumixmCvgDkGxGY8WetE8rX-WAEgEVQqCF5DymwuMTw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: Language Learning - Progress and Performance

Reply via email to