[opencog-dev] Re: Language Learning - Progress and Performance

Curtis Faith Sat, 03 Jun 2017 09:43:14 -0700

No, I'm using Guile 2.2.2. Indeed I did get that hanging bug earlier
because the install was originally into /usr/local/lib/ instead of
/usr/lib/ and I reported an issue on it
<https://github.com/opencog/opencog/issues/2741> two weeks ago. All my
tests were done with Guile 2.2.2. after I moved the library into /usr/lib/
after I got the configuration sorted out.


The hang I found wasn't an infinite loop. It actually finished. The
CogServer was spending all it's time in the Guile GC, and then when I came
back after letting it run all weekend on Monday it was finished.

On Sun, Jun 4, 2017 at 12:13 AM, Linas Vepstas <[email protected]>
wrote:

> Which version of guile are you using? You should be using guile-2.2.
> sooner or later, you also need to set up bdwgc with the ./configure
> --enable-large-config or you will get th dreaded `Too many heap sections:
> Increase MAXHINCR or MAX_HEAP_SECTS` error.
>
> Re guile inf loop. Yes, there is a bug in guile, I plan to open a bug
> report "real soon now".  I think you are describing this bug below. What I
> see is that:
> * the guile prompt stops responding
> * cpu spins at exactly 100% cpu usage.
> * GDB shows a very small, very tight spin loop, when I single-step : get
> this:
>
> single-stepping:
>
> scm_i_vm_mark_stack (vp=0xbdca3120, mark_stack_ptr=0x7f361fe85ef0,
>     mark_stack_limit=0x7f361fe95eb0) at ../../libguile/vm.c:975
> 975            fp = SCM_FRAME_DYNAMIC_LINK (fp))
> (gdb) print fp
> $1 = (union scm_vm_stack_element *) 0x7f36183d5e38
> (gdb) step
> 973       for (fp = vp->fp, sp = vp->sp;
> (gdb)
> 979           for (slot = nlocals - 1; sp < fp; sp++, slot--)
> (gdb)
> 1011          slot_map = find_slot_map (SCM_FRAME_RETURN_ADDRESS (fp),
> &cache);
> (gdb)
> 1006          sp = SCM_FRAME_PREVIOUS_SP (fp);
> (gdb) print fp
> $2 = (union scm_vm_stack_element *) 0x7f36183d5e38
> (gdb) print sp
> $3 = (union scm_vm_stack_element *) 0x7f36183d5e48
> (gdb) step
> 1011          slot_map = find_slot_map (SCM_FRAME_RETURN_ADDRESS (fp),
> &cache);
> (gdb)
> find_slot_map (cache=0x7f361fe85ab0, ip=0x1) at ../../libguile/vm.c:932
> 932       size_t slot = (((scm_t_uintptr) ip) >> 2) %
> SLOT_MAP_CACHE_SIZE;
> (gdb)
> 935       if (cache->entries[slot].ip == ip)
> (gdb)
> scm_i_vm_mark_stack (vp=0xbdca3120, mark_stack_ptr=0x7f361fe85ef0,
>     mark_stack_limit=0x7f361fe95eb0) at ../../libguile/vm.c:1011
> 1011          slot_map = find_slot_map (SCM_FRAME_RETURN_ADDRESS (fp),
> &cache);
> (gdb)
> find_slot_map (cache=0x7f361fe85ab0, ip=0x1) at ../../libguile/vm.c:936
> 936         map = cache->entries[slot].map;
> (gdb)
> scm_i_vm_mark_stack (vp=0xbdca3120, mark_stack_ptr=0x7f361fe85ef0,
>     mark_stack_limit=0x7f361fe95eb0) at ../../libguile/vm.c:975
> 975            fp = SCM_FRAME_DYNAMIC_LINK (fp))
> (gdb) print fp
> $4 = (union scm_vm_stack_element *) 0x7f36183d5e38
>
> This also happens in older version e.g. guile-2.1.8  Its a real bug, I
> have not had time to tell the guile people about it.
>
> weird thing is, everything was fine until about a 2-3 weeks ago, when I
> started seeing this bug. I was churning through huge quantities of text
> before this, and it was not an issue. At all. And now it is. I can't figure
> out why .. as I say, it still hangs with older versions, and I can't figure
> out what triggered this.
>
> --linas
>
> On Thu, Jun 1, 2017 at 11:09 PM, Curtis Faith <[email protected]>
> wrote:
>
>> So my intuition about the Guile GC was right about after all.
>>
>> I came into the office this morning after leaving last night with a test
>> of the 6,000 sentence file which took two days last week and started me on
>> this path before working on AWS.
>>
>> For the new test last night, I reintroduced the (gc) call in observe-text
>> but did not put back the fetch-atom and store-atom calls.
>>
>> I figured it would either finish quickly last night (if the SQL issue was
>> the only major problem) or run into the same problem I encountered last
>> time I ran it last week (if something else like the GC was causing the
>> slowdown). Either way I would be able to separate the GC performance issues
>> I suspected from the SQL performance issues I already noted.
>>
>> When I arrived, the cogserver was using 100% but was still running and
>> processing sentence #590. I suspected the Guile GC, since the CPU is
>> pegging one process at 100% on a 6CPU machine that usually has 1150% for
>> the cogserver when running. This matches the profile of one thread blocking
>> all the others, which seemed like the GC, I fired up gdb, attached to the
>> cog server, and did a backtrace:
>>
>> gdb) bt
>> #0  0x00007f562938cfb2 in do_sigsuspend (set=0x7f56244ea2c0)
>>     at ../sysdeps/unix/sysv/linux/sigsuspend.c:31
>> #1  __GI___sigsuspend (set=0x7f56244ea2c0)
>>     at ../sysdeps/unix/sysv/linux/sigsuspend.c:45
>> #2  0x00007f56242ce832 in *GC_suspend_handler_inner* ()
>>    from /usr/lib/x86_64-linux-gnu/libgc.so.1
>> #3  0x00007f56242ce8a5 in *GC_suspend_handler* ()
>>    from /usr/lib/x86_64-linux-gnu/libgc.so.1
>> #4  <signal handler called>
>> #5  0x00007f562941a93d in nanosleep () at ../sysdeps/unix/syscall-templa
>> te.S:81
>> #6  0x00007f562944bbc4 in usleep (useconds=<optimized out>)
>>     at ../sysdeps/unix/sysv/linux/usleep.c:32
>> #7  0x00007f562a2d7fb7 in opencog::CogServer::serverLoop (this=0x19727f0)
>>     at /home/inflector/opencog/opencog/opencog/cogserver/server/Cog
>> Server.cc:211
>> #8  0x0000000000406b05 in main (argc=3, argv=0x7ffec4633ae8)
>>     at /home/inflector/opencog/opencog/opencog/cogserver/server/Cog
>> ServerMain.cc:232
>> (gdb)
>>
>> This is the code that the Garbage collector invokes to suspend threads
>> while running. I've been digging into the code for the collector and found
>> several potential interactions which will result in what I've been seeing.
>> Essentially as the number of objects in the cogserver heap grows the
>> Conservative GC has to search a larger and larger space until eventually
>> this search takes a single core's entire processing capability. As the
>> search space grows larger and larger, this blocking begins to dominate all
>> execution.
>>
>> You can see this in my notes in the attached observe-text.log file where
>> I show the GC time compared with total time and it is 320 seconds of 463
>> seconds for my 5 chapter test file. So 69% of the processing when
>> processing only 303 sentences. And this percentage grows with successive
>> processing of the same file. These successive runs create no new atoms but
>> show a leak of 0.130G per execution with the (gc) call in observe-text; and
>> 0.400G per processing of the 5 chapters without the (gc) call inside
>> observe-text.
>>
>> Currently the job is sitting at 12 hours for one single sentence while
>> the garbage collector is repeatedly running while attempting to complete
>> it's job. Since I encountered this last week and just let it run over the
>> weekend, and when I came back on Monday it was finished, I suspect that the
>> GC is still churning away but being repeatedly invoked and at some point,
>> it will finish enough to process the sentence. Further the completion will
>> invoke another internal allocation which it always makes at 200% of the
>> previous size, this will cause the GC to be invoked much less again for
>> some period of time which will make the rest of the sentences process much
>> faster. But at some size the GC stops working as the processing time
>> required dominates a single CPU core.
>>
>> All the above means Guile's current GC won't work for AtomSpace. Not
>> without some significant work. That's the bad news.
>>
>> The good news is that I believe we can implement a fix for this for our
>> special case of Guile running in a separate process in the CogServer in a
>> few days. My idea is a bit of a hack but will result in extremely fast
>> allocations, zero leaks, and speedup the code by at least 2X over the small
>> file performance case where we're getting sub second processing of lines.
>> I'd not be surprised if we can hit 0.5 seconds in the no SQL saving case.
>>
>> I have attached a screenshot of the top where you can see 100% CPU for
>> the conserver and Terminal windows where you can see that there were no
>> errors starting the conserver and that it is on sentence 590 which took
>> 1049 seconds to get to with an average per-sentence time at that point of
>> 1.78 seconds. You can also see that there are no atoms in the atom space
>> (except for the default one that gets added to every atomspace).
>>
>> So my plan is: as a first step, I'm going to implement a quick-and-dirty
>> replacement for the GC which just uses malloc to see if that gets through
>> the file. If this approach works, this will be confirmation of the above,
>> but not be suitable for the long-term.
>>
>> Assuming the malloc-based allocations complete in reasonable time, I've
>> got some ideas for altering the GC code to work in a separate per-thread
>> heap for new allocations which we can then dispose en mass when the thread
>> terminates. Assuming we can get a decent bound on usage of guile-internal
>> objects during statement execution, this will be much much faster and won't
>> leak. Everyone wins.
>>
>> Let me know if you have any questions or suggestions.
>>
>> - Curtis
>>
>> On Thu, Jun 1, 2017 at 1:42 AM, Ben Goertzel <[email protected]> wrote:
>>
>>> Thanks a lot Curtis!
>>>
>>> I totally grok the desire to complete an analysis before announcing
>>> results... OTOH if we were all sitting in the same office we'd
>>> probably be communicating lots of interim results.  So generally I'm
>>> biased toward fostering communication! ...
>>>
>>> See you Monday...
>>>
>>> ben
>>>
>>> On Wed, May 31, 2017 at 7:58 AM, Curtis Faith <[email protected]>
>>> wrote:
>>> > I should have been more explicit and said that removing the store and
>>> fetch
>>> > was just a quick and dirty way of measuring the performance impact of
>>> this
>>> > issue, not a suggested solution to it.
>>> >
>>> > On Wed, May 31, 2017 at 10:13 PM, Curtis Faith <
>>> [email protected]>
>>> > wrote:
>>> >>
>>> >> I am sending this a bit earlier than I would normally as I am still
>>> trying
>>> >> to figure out the contribution of the Guile garbage collector on
>>> performance
>>> >> and am not quite done. I have also not yet measured a test over
>>> significant
>>> >> data for a before and after comparison, for example.
>>> >>
>>> >> But since Ben sent Linas an email about this problem, I'll send my
>>> work in
>>> >> progress, so Linas has a bit more information to answer Ben's
>>> question.
>>> >>
>>> >> So, the following represents a partial analysis of the SQL issues I
>>> found.
>>> >> I removed the code for fetch-atom and store-atom and while this is
>>> >> considerably faster, there still seems to be many-second pauses in
>>> >> processing that are not related to the number of words in the
>>> sentence in
>>> >> any way I can easily correlate. Guile garbage collector? This might
>>> be due
>>> >> to a memory leak which is resulting in the garbage collector having to
>>> >> search over an ever increasing memory space looking for object-ish
>>> >> references using it's conservative scheme.
>>> >>
>>> >> - Curtis
>>> >>
>>> >> -----------------------
>>> >>
>>> >> My 6,000 sentence test file finished over the weekend. It took about
>>> 48
>>> >> hours.
>>> >>
>>> >> For some context, the input corpus, Pride and Prejudice, has
>>> >> approximately:
>>> >>
>>> >> 120 K words
>>> >> 700 K bytes
>>> >> 6 K sentences
>>> >> 2 K paragraphs
>>> >> 25 average words per sentence
>>> >>
>>> >> So it is a fair sized novel but only a tiny tiny fraction of what we'd
>>> >> like to feed the language learning system.
>>> >>
>>> >> In looking at the Postgres statistics for updates on a freshly
>>> initialized
>>> >> database, I found:
>>> >>
>>> >> SELECT relname, idx_scan, idx_tup_fetch, n_tup_ins, n_tup_upd,
>>> n_tup_del
>>> >> FROM pg_stat_user_tables;
>>> >>
>>> >>
>>> >>   relname   | idx_scan  | idx_tup_fetch | n_tup_ins | n_tup_upd |
>>> >> n_tup_del
>>> >>
>>> >> ------------+-----------+---------------+-----------+-------
>>> ----+-----------
>>> >>  atoms      |  95789119 |      90072278 |   3288742 |         0 |
>>> >> 0
>>> >>  values     |         4 |             0 |         0 |         0 |
>>> >> 0
>>> >>  typecodes  |         0 |             0 |       228 |         0 |
>>> >> 0
>>> >>  valuations | 131581719 |      84520666 |  44519326 |         0 |
>>> >> 41926158
>>> >>  spaces     |   3288742 |       3288742 |         2 |         0 |
>>> >> 0
>>> >> (5 rows)
>>> >>
>>> >> So roughly 90M fetches for Atoms and 85M for Valuations while parsing
>>> a
>>> >> single file of 6K sentences.
>>> >>
>>> >> We have 3.3M atoms inserted and 90.0M index tuple fetches for the atom
>>> >> table.
>>> >>
>>> >> In digging into the code, i found we have an explicit delete and
>>> insert
>>> >> for every single increment of every atom count maintained by the
>>> system. The
>>> >> count increment code is also recursive for the trees defined by each
>>> pair
>>> >> which will in practice be many different atoms because even though we
>>> only
>>> >> increment the count for one atom in the tree, the store-atom and
>>> fetch-atom
>>> >> calls update the truth values for every atom in the tree that has a
>>> truth
>>> >> value. The pair counting code has this comment:
>>> >>
>>> >> ; ------------------------------------------------------------
>>> ---------
>>> >> ; update-pair-counts -- count occurances of word-pairs in a parse.
>>> >> ;
>>> >> ; Note that this might throw an exception...
>>> >> ;
>>> >> ; The structures that get created and incremented are of the form
>>> >> ;
>>> >> ;     EvaluationLink
>>> >> ;         PredicateNode "*-Sentence Word Pair-*"
>>> >> ;         ListLink
>>> >> ;             WordNode "lefty"  -- or whatever words these are.
>>> >> ;             WordNode "righty"
>>> >> ;
>>> >> ;     ExecutionLink
>>> >> ;         SchemaNode "*-Pair Distance-*"
>>> >> ;         ListLink
>>> >> ;             WordNode "lefty"
>>> >> ;             WordNode "righty"
>>> >> ;         NumberNode 3
>>> >>
>>> >>
>>> >> So every pair has 11 atoms.
>>> >>
>>> >> The truth value gets set only at the top level but the fetch and
>>> store end
>>> >> up retrieving and storing the valuations for the entire tree where
>>> there are
>>> >> truth values.
>>> >>
>>> >> Further, the counting code is storing and and fetching the truth
>>> values
>>> >> for each unique count value each time it gets incremented. So we'll
>>> have (1:
>>> >> an insert for 1), then (2: a delete of the 1, and an insert for 2),
>>> then (3:
>>> >> a delete of the 2, and an insert for 3), (4: a delete for 3, and an
>>> insert
>>> >> for 4), etc. And this is done for the EvaluationLink, ExecutionLink
>>> and four
>>> >> Word Nodes in the above example.
>>> >>
>>> >> I wanted to verify this assumption, so I created a fresh database and
>>> >> turned on statement logging in Postgres. It resulted in the attached
>>> log
>>> >> file for a test of the a four-word sentence:
>>> >>
>>> >> (observe-text "This is a test.")
>>> >>
>>> >> We got 5419 lines in the log file (not counting the 333 lines of
>>> startup
>>> >> creations of types etc.), each of which represents a single SQL
>>> statement.
>>> >> There are:
>>> >>
>>> >> 813 BEGIN,
>>> >> 813 COMMIT,
>>> >> 2127 SELECT,
>>> >> 720 DELETE, and
>>> >> 946 INSERT
>>> >>
>>> >> statements for a four word sentence. (See the attached text log for
>>> >> details)
>>> >>
>>> >> There are 133 new atoms created and 93 valuations after the call to
>>> >> observe-text is finished.
>>> >>
>>> >>
>>> >
>>>
>>>
>>>
>>> --
>>> Ben Goertzel, PhD
>>> http://goertzel.org
>>>
>>> "I am God! I am nothing, I'm play, I am freedom, I am life. I am the
>>> boundary, I am the peak." -- Alexander Scriabin
>>>
>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAJzHpFqPfE5YB9-u6gfvrz%3DBSX0fTJ0jgiinUe66kPr1SurPXg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: Language Learning - Progress and Performance

Reply via email to