No, I'm using Guile 2.2.2. Indeed I did get that hanging bug earlier because the install was originally into /usr/local/lib/ instead of /usr/lib/ and I reported an issue on it <https://github.com/opencog/opencog/issues/2741> two weeks ago. All my tests were done with Guile 2.2.2. after I moved the library into /usr/lib/ after I got the configuration sorted out.
The hang I found wasn't an infinite loop. It actually finished. The CogServer was spending all it's time in the Guile GC, and then when I came back after letting it run all weekend on Monday it was finished. On Sun, Jun 4, 2017 at 12:13 AM, Linas Vepstas <[email protected]> wrote: > Which version of guile are you using? You should be using guile-2.2. > sooner or later, you also need to set up bdwgc with the ./configure > --enable-large-config or you will get th dreaded `Too many heap sections: > Increase MAXHINCR or MAX_HEAP_SECTS` error. > > Re guile inf loop. Yes, there is a bug in guile, I plan to open a bug > report "real soon now". I think you are describing this bug below. What I > see is that: > * the guile prompt stops responding > * cpu spins at exactly 100% cpu usage. > * GDB shows a very small, very tight spin loop, when I single-step : get > this: > > single-stepping: > > scm_i_vm_mark_stack (vp=0xbdca3120, mark_stack_ptr=0x7f361fe85ef0, > mark_stack_limit=0x7f361fe95eb0) at ../../libguile/vm.c:975 > 975 fp = SCM_FRAME_DYNAMIC_LINK (fp)) > (gdb) print fp > $1 = (union scm_vm_stack_element *) 0x7f36183d5e38 > (gdb) step > 973 for (fp = vp->fp, sp = vp->sp; > (gdb) > 979 for (slot = nlocals - 1; sp < fp; sp++, slot--) > (gdb) > 1011 slot_map = find_slot_map (SCM_FRAME_RETURN_ADDRESS (fp), > &cache); > (gdb) > 1006 sp = SCM_FRAME_PREVIOUS_SP (fp); > (gdb) print fp > $2 = (union scm_vm_stack_element *) 0x7f36183d5e38 > (gdb) print sp > $3 = (union scm_vm_stack_element *) 0x7f36183d5e48 > (gdb) step > 1011 slot_map = find_slot_map (SCM_FRAME_RETURN_ADDRESS (fp), > &cache); > (gdb) > find_slot_map (cache=0x7f361fe85ab0, ip=0x1) at ../../libguile/vm.c:932 > 932 size_t slot = (((scm_t_uintptr) ip) >> 2) % > SLOT_MAP_CACHE_SIZE; > (gdb) > 935 if (cache->entries[slot].ip == ip) > (gdb) > scm_i_vm_mark_stack (vp=0xbdca3120, mark_stack_ptr=0x7f361fe85ef0, > mark_stack_limit=0x7f361fe95eb0) at ../../libguile/vm.c:1011 > 1011 slot_map = find_slot_map (SCM_FRAME_RETURN_ADDRESS (fp), > &cache); > (gdb) > find_slot_map (cache=0x7f361fe85ab0, ip=0x1) at ../../libguile/vm.c:936 > 936 map = cache->entries[slot].map; > (gdb) > scm_i_vm_mark_stack (vp=0xbdca3120, mark_stack_ptr=0x7f361fe85ef0, > mark_stack_limit=0x7f361fe95eb0) at ../../libguile/vm.c:975 > 975 fp = SCM_FRAME_DYNAMIC_LINK (fp)) > (gdb) print fp > $4 = (union scm_vm_stack_element *) 0x7f36183d5e38 > > This also happens in older version e.g. guile-2.1.8 Its a real bug, I > have not had time to tell the guile people about it. > > weird thing is, everything was fine until about a 2-3 weeks ago, when I > started seeing this bug. I was churning through huge quantities of text > before this, and it was not an issue. At all. And now it is. I can't figure > out why .. as I say, it still hangs with older versions, and I can't figure > out what triggered this. > > --linas > > On Thu, Jun 1, 2017 at 11:09 PM, Curtis Faith <[email protected]> > wrote: > >> So my intuition about the Guile GC was right about after all. >> >> I came into the office this morning after leaving last night with a test >> of the 6,000 sentence file which took two days last week and started me on >> this path before working on AWS. >> >> For the new test last night, I reintroduced the (gc) call in observe-text >> but did not put back the fetch-atom and store-atom calls. >> >> I figured it would either finish quickly last night (if the SQL issue was >> the only major problem) or run into the same problem I encountered last >> time I ran it last week (if something else like the GC was causing the >> slowdown). Either way I would be able to separate the GC performance issues >> I suspected from the SQL performance issues I already noted. >> >> When I arrived, the cogserver was using 100% but was still running and >> processing sentence #590. I suspected the Guile GC, since the CPU is >> pegging one process at 100% on a 6CPU machine that usually has 1150% for >> the cogserver when running. This matches the profile of one thread blocking >> all the others, which seemed like the GC, I fired up gdb, attached to the >> cog server, and did a backtrace: >> >> gdb) bt >> #0 0x00007f562938cfb2 in do_sigsuspend (set=0x7f56244ea2c0) >> at ../sysdeps/unix/sysv/linux/sigsuspend.c:31 >> #1 __GI___sigsuspend (set=0x7f56244ea2c0) >> at ../sysdeps/unix/sysv/linux/sigsuspend.c:45 >> #2 0x00007f56242ce832 in *GC_suspend_handler_inner* () >> from /usr/lib/x86_64-linux-gnu/libgc.so.1 >> #3 0x00007f56242ce8a5 in *GC_suspend_handler* () >> from /usr/lib/x86_64-linux-gnu/libgc.so.1 >> #4 <signal handler called> >> #5 0x00007f562941a93d in nanosleep () at ../sysdeps/unix/syscall-templa >> te.S:81 >> #6 0x00007f562944bbc4 in usleep (useconds=<optimized out>) >> at ../sysdeps/unix/sysv/linux/usleep.c:32 >> #7 0x00007f562a2d7fb7 in opencog::CogServer::serverLoop (this=0x19727f0) >> at /home/inflector/opencog/opencog/opencog/cogserver/server/Cog >> Server.cc:211 >> #8 0x0000000000406b05 in main (argc=3, argv=0x7ffec4633ae8) >> at /home/inflector/opencog/opencog/opencog/cogserver/server/Cog >> ServerMain.cc:232 >> (gdb) >> >> This is the code that the Garbage collector invokes to suspend threads >> while running. I've been digging into the code for the collector and found >> several potential interactions which will result in what I've been seeing. >> Essentially as the number of objects in the cogserver heap grows the >> Conservative GC has to search a larger and larger space until eventually >> this search takes a single core's entire processing capability. As the >> search space grows larger and larger, this blocking begins to dominate all >> execution. >> >> You can see this in my notes in the attached observe-text.log file where >> I show the GC time compared with total time and it is 320 seconds of 463 >> seconds for my 5 chapter test file. So 69% of the processing when >> processing only 303 sentences. And this percentage grows with successive >> processing of the same file. These successive runs create no new atoms but >> show a leak of 0.130G per execution with the (gc) call in observe-text; and >> 0.400G per processing of the 5 chapters without the (gc) call inside >> observe-text. >> >> Currently the job is sitting at 12 hours for one single sentence while >> the garbage collector is repeatedly running while attempting to complete >> it's job. Since I encountered this last week and just let it run over the >> weekend, and when I came back on Monday it was finished, I suspect that the >> GC is still churning away but being repeatedly invoked and at some point, >> it will finish enough to process the sentence. Further the completion will >> invoke another internal allocation which it always makes at 200% of the >> previous size, this will cause the GC to be invoked much less again for >> some period of time which will make the rest of the sentences process much >> faster. But at some size the GC stops working as the processing time >> required dominates a single CPU core. >> >> All the above means Guile's current GC won't work for AtomSpace. Not >> without some significant work. That's the bad news. >> >> The good news is that I believe we can implement a fix for this for our >> special case of Guile running in a separate process in the CogServer in a >> few days. My idea is a bit of a hack but will result in extremely fast >> allocations, zero leaks, and speedup the code by at least 2X over the small >> file performance case where we're getting sub second processing of lines. >> I'd not be surprised if we can hit 0.5 seconds in the no SQL saving case. >> >> I have attached a screenshot of the top where you can see 100% CPU for >> the conserver and Terminal windows where you can see that there were no >> errors starting the conserver and that it is on sentence 590 which took >> 1049 seconds to get to with an average per-sentence time at that point of >> 1.78 seconds. You can also see that there are no atoms in the atom space >> (except for the default one that gets added to every atomspace). >> >> So my plan is: as a first step, I'm going to implement a quick-and-dirty >> replacement for the GC which just uses malloc to see if that gets through >> the file. If this approach works, this will be confirmation of the above, >> but not be suitable for the long-term. >> >> Assuming the malloc-based allocations complete in reasonable time, I've >> got some ideas for altering the GC code to work in a separate per-thread >> heap for new allocations which we can then dispose en mass when the thread >> terminates. Assuming we can get a decent bound on usage of guile-internal >> objects during statement execution, this will be much much faster and won't >> leak. Everyone wins. >> >> Let me know if you have any questions or suggestions. >> >> - Curtis >> >> On Thu, Jun 1, 2017 at 1:42 AM, Ben Goertzel <[email protected]> wrote: >> >>> Thanks a lot Curtis! >>> >>> I totally grok the desire to complete an analysis before announcing >>> results... OTOH if we were all sitting in the same office we'd >>> probably be communicating lots of interim results. So generally I'm >>> biased toward fostering communication! ... >>> >>> See you Monday... >>> >>> ben >>> >>> On Wed, May 31, 2017 at 7:58 AM, Curtis Faith <[email protected]> >>> wrote: >>> > I should have been more explicit and said that removing the store and >>> fetch >>> > was just a quick and dirty way of measuring the performance impact of >>> this >>> > issue, not a suggested solution to it. >>> > >>> > On Wed, May 31, 2017 at 10:13 PM, Curtis Faith < >>> [email protected]> >>> > wrote: >>> >> >>> >> I am sending this a bit earlier than I would normally as I am still >>> trying >>> >> to figure out the contribution of the Guile garbage collector on >>> performance >>> >> and am not quite done. I have also not yet measured a test over >>> significant >>> >> data for a before and after comparison, for example. >>> >> >>> >> But since Ben sent Linas an email about this problem, I'll send my >>> work in >>> >> progress, so Linas has a bit more information to answer Ben's >>> question. >>> >> >>> >> So, the following represents a partial analysis of the SQL issues I >>> found. >>> >> I removed the code for fetch-atom and store-atom and while this is >>> >> considerably faster, there still seems to be many-second pauses in >>> >> processing that are not related to the number of words in the >>> sentence in >>> >> any way I can easily correlate. Guile garbage collector? This might >>> be due >>> >> to a memory leak which is resulting in the garbage collector having to >>> >> search over an ever increasing memory space looking for object-ish >>> >> references using it's conservative scheme. >>> >> >>> >> - Curtis >>> >> >>> >> ----------------------- >>> >> >>> >> My 6,000 sentence test file finished over the weekend. It took about >>> 48 >>> >> hours. >>> >> >>> >> For some context, the input corpus, Pride and Prejudice, has >>> >> approximately: >>> >> >>> >> 120 K words >>> >> 700 K bytes >>> >> 6 K sentences >>> >> 2 K paragraphs >>> >> 25 average words per sentence >>> >> >>> >> So it is a fair sized novel but only a tiny tiny fraction of what we'd >>> >> like to feed the language learning system. >>> >> >>> >> In looking at the Postgres statistics for updates on a freshly >>> initialized >>> >> database, I found: >>> >> >>> >> SELECT relname, idx_scan, idx_tup_fetch, n_tup_ins, n_tup_upd, >>> n_tup_del >>> >> FROM pg_stat_user_tables; >>> >> >>> >> >>> >> relname | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | >>> >> n_tup_del >>> >> >>> >> ------------+-----------+---------------+-----------+------- >>> ----+----------- >>> >> atoms | 95789119 | 90072278 | 3288742 | 0 | >>> >> 0 >>> >> values | 4 | 0 | 0 | 0 | >>> >> 0 >>> >> typecodes | 0 | 0 | 228 | 0 | >>> >> 0 >>> >> valuations | 131581719 | 84520666 | 44519326 | 0 | >>> >> 41926158 >>> >> spaces | 3288742 | 3288742 | 2 | 0 | >>> >> 0 >>> >> (5 rows) >>> >> >>> >> So roughly 90M fetches for Atoms and 85M for Valuations while parsing >>> a >>> >> single file of 6K sentences. >>> >> >>> >> We have 3.3M atoms inserted and 90.0M index tuple fetches for the atom >>> >> table. >>> >> >>> >> In digging into the code, i found we have an explicit delete and >>> insert >>> >> for every single increment of every atom count maintained by the >>> system. The >>> >> count increment code is also recursive for the trees defined by each >>> pair >>> >> which will in practice be many different atoms because even though we >>> only >>> >> increment the count for one atom in the tree, the store-atom and >>> fetch-atom >>> >> calls update the truth values for every atom in the tree that has a >>> truth >>> >> value. The pair counting code has this comment: >>> >> >>> >> ; ------------------------------------------------------------ >>> --------- >>> >> ; update-pair-counts -- count occurances of word-pairs in a parse. >>> >> ; >>> >> ; Note that this might throw an exception... >>> >> ; >>> >> ; The structures that get created and incremented are of the form >>> >> ; >>> >> ; EvaluationLink >>> >> ; PredicateNode "*-Sentence Word Pair-*" >>> >> ; ListLink >>> >> ; WordNode "lefty" -- or whatever words these are. >>> >> ; WordNode "righty" >>> >> ; >>> >> ; ExecutionLink >>> >> ; SchemaNode "*-Pair Distance-*" >>> >> ; ListLink >>> >> ; WordNode "lefty" >>> >> ; WordNode "righty" >>> >> ; NumberNode 3 >>> >> >>> >> >>> >> So every pair has 11 atoms. >>> >> >>> >> The truth value gets set only at the top level but the fetch and >>> store end >>> >> up retrieving and storing the valuations for the entire tree where >>> there are >>> >> truth values. >>> >> >>> >> Further, the counting code is storing and and fetching the truth >>> values >>> >> for each unique count value each time it gets incremented. So we'll >>> have (1: >>> >> an insert for 1), then (2: a delete of the 1, and an insert for 2), >>> then (3: >>> >> a delete of the 2, and an insert for 3), (4: a delete for 3, and an >>> insert >>> >> for 4), etc. And this is done for the EvaluationLink, ExecutionLink >>> and four >>> >> Word Nodes in the above example. >>> >> >>> >> I wanted to verify this assumption, so I created a fresh database and >>> >> turned on statement logging in Postgres. It resulted in the attached >>> log >>> >> file for a test of the a four-word sentence: >>> >> >>> >> (observe-text "This is a test.") >>> >> >>> >> We got 5419 lines in the log file (not counting the 333 lines of >>> startup >>> >> creations of types etc.), each of which represents a single SQL >>> statement. >>> >> There are: >>> >> >>> >> 813 BEGIN, >>> >> 813 COMMIT, >>> >> 2127 SELECT, >>> >> 720 DELETE, and >>> >> 946 INSERT >>> >> >>> >> statements for a four word sentence. (See the attached text log for >>> >> details) >>> >> >>> >> There are 133 new atoms created and 93 valuations after the call to >>> >> observe-text is finished. >>> >> >>> >> >>> > >>> >>> >>> >>> -- >>> Ben Goertzel, PhD >>> http://goertzel.org >>> >>> "I am God! I am nothing, I'm play, I am freedom, I am life. I am the >>> boundary, I am the peak." -- Alexander Scriabin >>> >> >> > -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAJzHpFqPfE5YB9-u6gfvrz%3DBSX0fTJ0jgiinUe66kPr1SurPXg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
