Hmm. In my case, my "infinite loop" did not clear up overnight, while I slept, so I assumed it was infinite.
To those who do threaded programming, there is a situation where there can be a very small window to unlock a stuck loop. So, for example, something might be stuck in a loop, spinning, lock held 99.99% of the time, but there is a slim chance that some other thread sneaks in during the 0.01% window, fixes whatever the condition is, then the loop gets unstuck, and the program eventually finishes without errors. It's possible that that is the case here -- i.e. it eventually finishes, if you wait long enough. Its still a bug. Seems like this is moving to the top of my priority list, now. But first, I have to plow through my emails. --linas On Sat, Jun 3, 2017 at 11:42 AM, Curtis Faith <[email protected]> wrote: > No, I'm using Guile 2.2.2. Indeed I did get that hanging bug earlier > because the install was originally into /usr/local/lib/ instead of > /usr/lib/ and I reported an issue on it > <https://github.com/opencog/opencog/issues/2741> two weeks ago. All my > tests were done with Guile 2.2.2. after I moved the library into /usr/lib/ > after I got the configuration sorted out. > > The hang I found wasn't an infinite loop. It actually finished. The > CogServer was spending all it's time in the Guile GC, and then when I came > back after letting it run all weekend on Monday it was finished. > > On Sun, Jun 4, 2017 at 12:13 AM, Linas Vepstas <[email protected]> > wrote: > >> Which version of guile are you using? You should be using guile-2.2. >> sooner or later, you also need to set up bdwgc with the ./configure >> --enable-large-config or you will get th dreaded `Too many heap sections: >> Increase MAXHINCR or MAX_HEAP_SECTS` error. >> >> Re guile inf loop. Yes, there is a bug in guile, I plan to open a bug >> report "real soon now". I think you are describing this bug below. What I >> see is that: >> * the guile prompt stops responding >> * cpu spins at exactly 100% cpu usage. >> * GDB shows a very small, very tight spin loop, when I single-step : get >> this: >> >> single-stepping: >> >> scm_i_vm_mark_stack (vp=0xbdca3120, mark_stack_ptr=0x7f361fe85ef0, >> mark_stack_limit=0x7f361fe95eb0) at ../../libguile/vm.c:975 >> 975 fp = SCM_FRAME_DYNAMIC_LINK (fp)) >> (gdb) print fp >> $1 = (union scm_vm_stack_element *) 0x7f36183d5e38 >> (gdb) step >> 973 for (fp = vp->fp, sp = vp->sp; >> (gdb) >> 979 for (slot = nlocals - 1; sp < fp; sp++, slot--) >> (gdb) >> 1011 slot_map = find_slot_map (SCM_FRAME_RETURN_ADDRESS (fp), >> &cache); >> (gdb) >> 1006 sp = SCM_FRAME_PREVIOUS_SP (fp); >> (gdb) print fp >> $2 = (union scm_vm_stack_element *) 0x7f36183d5e38 >> (gdb) print sp >> $3 = (union scm_vm_stack_element *) 0x7f36183d5e48 >> (gdb) step >> 1011 slot_map = find_slot_map (SCM_FRAME_RETURN_ADDRESS (fp), >> &cache); >> (gdb) >> find_slot_map (cache=0x7f361fe85ab0, ip=0x1) at ../../libguile/vm.c:932 >> 932 size_t slot = (((scm_t_uintptr) ip) >> 2) % >> SLOT_MAP_CACHE_SIZE; >> (gdb) >> 935 if (cache->entries[slot].ip == ip) >> (gdb) >> scm_i_vm_mark_stack (vp=0xbdca3120, mark_stack_ptr=0x7f361fe85ef0, >> mark_stack_limit=0x7f361fe95eb0) at ../../libguile/vm.c:1011 >> 1011 slot_map = find_slot_map (SCM_FRAME_RETURN_ADDRESS (fp), >> &cache); >> (gdb) >> find_slot_map (cache=0x7f361fe85ab0, ip=0x1) at ../../libguile/vm.c:936 >> 936 map = cache->entries[slot].map; >> (gdb) >> scm_i_vm_mark_stack (vp=0xbdca3120, mark_stack_ptr=0x7f361fe85ef0, >> mark_stack_limit=0x7f361fe95eb0) at ../../libguile/vm.c:975 >> 975 fp = SCM_FRAME_DYNAMIC_LINK (fp)) >> (gdb) print fp >> $4 = (union scm_vm_stack_element *) 0x7f36183d5e38 >> >> This also happens in older version e.g. guile-2.1.8 Its a real bug, I >> have not had time to tell the guile people about it. >> >> weird thing is, everything was fine until about a 2-3 weeks ago, when I >> started seeing this bug. I was churning through huge quantities of text >> before this, and it was not an issue. At all. And now it is. I can't figure >> out why .. as I say, it still hangs with older versions, and I can't figure >> out what triggered this. >> >> --linas >> >> On Thu, Jun 1, 2017 at 11:09 PM, Curtis Faith <[email protected]> >> wrote: >> >>> So my intuition about the Guile GC was right about after all. >>> >>> I came into the office this morning after leaving last night with a test >>> of the 6,000 sentence file which took two days last week and started me on >>> this path before working on AWS. >>> >>> For the new test last night, I reintroduced the (gc) call in >>> observe-text but did not put back the fetch-atom and store-atom calls. >>> >>> I figured it would either finish quickly last night (if the SQL issue >>> was the only major problem) or run into the same problem I encountered last >>> time I ran it last week (if something else like the GC was causing the >>> slowdown). Either way I would be able to separate the GC performance issues >>> I suspected from the SQL performance issues I already noted. >>> >>> When I arrived, the cogserver was using 100% but was still running and >>> processing sentence #590. I suspected the Guile GC, since the CPU is >>> pegging one process at 100% on a 6CPU machine that usually has 1150% for >>> the cogserver when running. This matches the profile of one thread blocking >>> all the others, which seemed like the GC, I fired up gdb, attached to the >>> cog server, and did a backtrace: >>> >>> gdb) bt >>> #0 0x00007f562938cfb2 in do_sigsuspend (set=0x7f56244ea2c0) >>> at ../sysdeps/unix/sysv/linux/sigsuspend.c:31 >>> #1 __GI___sigsuspend (set=0x7f56244ea2c0) >>> at ../sysdeps/unix/sysv/linux/sigsuspend.c:45 >>> #2 0x00007f56242ce832 in *GC_suspend_handler_inner* () >>> from /usr/lib/x86_64-linux-gnu/libgc.so.1 >>> #3 0x00007f56242ce8a5 in *GC_suspend_handler* () >>> from /usr/lib/x86_64-linux-gnu/libgc.so.1 >>> #4 <signal handler called> >>> #5 0x00007f562941a93d in nanosleep () at ../sysdeps/unix/syscall-templa >>> te.S:81 >>> #6 0x00007f562944bbc4 in usleep (useconds=<optimized out>) >>> at ../sysdeps/unix/sysv/linux/usleep.c:32 >>> #7 0x00007f562a2d7fb7 in opencog::CogServer::serverLoop (this=0x19727f0) >>> at /home/inflector/opencog/opencog/opencog/cogserver/server/Cog >>> Server.cc:211 >>> #8 0x0000000000406b05 in main (argc=3, argv=0x7ffec4633ae8) >>> at /home/inflector/opencog/opencog/opencog/cogserver/server/Cog >>> ServerMain.cc:232 >>> (gdb) >>> >>> This is the code that the Garbage collector invokes to suspend threads >>> while running. I've been digging into the code for the collector and found >>> several potential interactions which will result in what I've been seeing. >>> Essentially as the number of objects in the cogserver heap grows the >>> Conservative GC has to search a larger and larger space until eventually >>> this search takes a single core's entire processing capability. As the >>> search space grows larger and larger, this blocking begins to dominate all >>> execution. >>> >>> You can see this in my notes in the attached observe-text.log file where >>> I show the GC time compared with total time and it is 320 seconds of 463 >>> seconds for my 5 chapter test file. So 69% of the processing when >>> processing only 303 sentences. And this percentage grows with successive >>> processing of the same file. These successive runs create no new atoms but >>> show a leak of 0.130G per execution with the (gc) call in observe-text; and >>> 0.400G per processing of the 5 chapters without the (gc) call inside >>> observe-text. >>> >>> Currently the job is sitting at 12 hours for one single sentence while >>> the garbage collector is repeatedly running while attempting to complete >>> it's job. Since I encountered this last week and just let it run over the >>> weekend, and when I came back on Monday it was finished, I suspect that the >>> GC is still churning away but being repeatedly invoked and at some point, >>> it will finish enough to process the sentence. Further the completion will >>> invoke another internal allocation which it always makes at 200% of the >>> previous size, this will cause the GC to be invoked much less again for >>> some period of time which will make the rest of the sentences process much >>> faster. But at some size the GC stops working as the processing time >>> required dominates a single CPU core. >>> >>> All the above means Guile's current GC won't work for AtomSpace. Not >>> without some significant work. That's the bad news. >>> >>> The good news is that I believe we can implement a fix for this for our >>> special case of Guile running in a separate process in the CogServer in a >>> few days. My idea is a bit of a hack but will result in extremely fast >>> allocations, zero leaks, and speedup the code by at least 2X over the small >>> file performance case where we're getting sub second processing of lines. >>> I'd not be surprised if we can hit 0.5 seconds in the no SQL saving case. >>> >>> I have attached a screenshot of the top where you can see 100% CPU for >>> the conserver and Terminal windows where you can see that there were no >>> errors starting the conserver and that it is on sentence 590 which took >>> 1049 seconds to get to with an average per-sentence time at that point of >>> 1.78 seconds. You can also see that there are no atoms in the atom space >>> (except for the default one that gets added to every atomspace). >>> >>> So my plan is: as a first step, I'm going to implement a quick-and-dirty >>> replacement for the GC which just uses malloc to see if that gets through >>> the file. If this approach works, this will be confirmation of the above, >>> but not be suitable for the long-term. >>> >>> Assuming the malloc-based allocations complete in reasonable time, I've >>> got some ideas for altering the GC code to work in a separate per-thread >>> heap for new allocations which we can then dispose en mass when the thread >>> terminates. Assuming we can get a decent bound on usage of guile-internal >>> objects during statement execution, this will be much much faster and won't >>> leak. Everyone wins. >>> >>> Let me know if you have any questions or suggestions. >>> >>> - Curtis >>> >>> On Thu, Jun 1, 2017 at 1:42 AM, Ben Goertzel <[email protected]> wrote: >>> >>>> Thanks a lot Curtis! >>>> >>>> I totally grok the desire to complete an analysis before announcing >>>> results... OTOH if we were all sitting in the same office we'd >>>> probably be communicating lots of interim results. So generally I'm >>>> biased toward fostering communication! ... >>>> >>>> See you Monday... >>>> >>>> ben >>>> >>>> On Wed, May 31, 2017 at 7:58 AM, Curtis Faith <[email protected]> >>>> wrote: >>>> > I should have been more explicit and said that removing the store and >>>> fetch >>>> > was just a quick and dirty way of measuring the performance impact of >>>> this >>>> > issue, not a suggested solution to it. >>>> > >>>> > On Wed, May 31, 2017 at 10:13 PM, Curtis Faith < >>>> [email protected]> >>>> > wrote: >>>> >> >>>> >> I am sending this a bit earlier than I would normally as I am still >>>> trying >>>> >> to figure out the contribution of the Guile garbage collector on >>>> performance >>>> >> and am not quite done. I have also not yet measured a test over >>>> significant >>>> >> data for a before and after comparison, for example. >>>> >> >>>> >> But since Ben sent Linas an email about this problem, I'll send my >>>> work in >>>> >> progress, so Linas has a bit more information to answer Ben's >>>> question. >>>> >> >>>> >> So, the following represents a partial analysis of the SQL issues I >>>> found. >>>> >> I removed the code for fetch-atom and store-atom and while this is >>>> >> considerably faster, there still seems to be many-second pauses in >>>> >> processing that are not related to the number of words in the >>>> sentence in >>>> >> any way I can easily correlate. Guile garbage collector? This might >>>> be due >>>> >> to a memory leak which is resulting in the garbage collector having >>>> to >>>> >> search over an ever increasing memory space looking for object-ish >>>> >> references using it's conservative scheme. >>>> >> >>>> >> - Curtis >>>> >> >>>> >> ----------------------- >>>> >> >>>> >> My 6,000 sentence test file finished over the weekend. It took about >>>> 48 >>>> >> hours. >>>> >> >>>> >> For some context, the input corpus, Pride and Prejudice, has >>>> >> approximately: >>>> >> >>>> >> 120 K words >>>> >> 700 K bytes >>>> >> 6 K sentences >>>> >> 2 K paragraphs >>>> >> 25 average words per sentence >>>> >> >>>> >> So it is a fair sized novel but only a tiny tiny fraction of what >>>> we'd >>>> >> like to feed the language learning system. >>>> >> >>>> >> In looking at the Postgres statistics for updates on a freshly >>>> initialized >>>> >> database, I found: >>>> >> >>>> >> SELECT relname, idx_scan, idx_tup_fetch, n_tup_ins, n_tup_upd, >>>> n_tup_del >>>> >> FROM pg_stat_user_tables; >>>> >> >>>> >> >>>> >> relname | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | >>>> >> n_tup_del >>>> >> >>>> >> ------------+-----------+---------------+-----------+------- >>>> ----+----------- >>>> >> atoms | 95789119 | 90072278 | 3288742 | 0 | >>>> >> 0 >>>> >> values | 4 | 0 | 0 | 0 | >>>> >> 0 >>>> >> typecodes | 0 | 0 | 228 | 0 | >>>> >> 0 >>>> >> valuations | 131581719 | 84520666 | 44519326 | 0 | >>>> >> 41926158 >>>> >> spaces | 3288742 | 3288742 | 2 | 0 | >>>> >> 0 >>>> >> (5 rows) >>>> >> >>>> >> So roughly 90M fetches for Atoms and 85M for Valuations while >>>> parsing a >>>> >> single file of 6K sentences. >>>> >> >>>> >> We have 3.3M atoms inserted and 90.0M index tuple fetches for the >>>> atom >>>> >> table. >>>> >> >>>> >> In digging into the code, i found we have an explicit delete and >>>> insert >>>> >> for every single increment of every atom count maintained by the >>>> system. The >>>> >> count increment code is also recursive for the trees defined by each >>>> pair >>>> >> which will in practice be many different atoms because even though >>>> we only >>>> >> increment the count for one atom in the tree, the store-atom and >>>> fetch-atom >>>> >> calls update the truth values for every atom in the tree that has a >>>> truth >>>> >> value. The pair counting code has this comment: >>>> >> >>>> >> ; ------------------------------------------------------------ >>>> --------- >>>> >> ; update-pair-counts -- count occurances of word-pairs in a parse. >>>> >> ; >>>> >> ; Note that this might throw an exception... >>>> >> ; >>>> >> ; The structures that get created and incremented are of the form >>>> >> ; >>>> >> ; EvaluationLink >>>> >> ; PredicateNode "*-Sentence Word Pair-*" >>>> >> ; ListLink >>>> >> ; WordNode "lefty" -- or whatever words these are. >>>> >> ; WordNode "righty" >>>> >> ; >>>> >> ; ExecutionLink >>>> >> ; SchemaNode "*-Pair Distance-*" >>>> >> ; ListLink >>>> >> ; WordNode "lefty" >>>> >> ; WordNode "righty" >>>> >> ; NumberNode 3 >>>> >> >>>> >> >>>> >> So every pair has 11 atoms. >>>> >> >>>> >> The truth value gets set only at the top level but the fetch and >>>> store end >>>> >> up retrieving and storing the valuations for the entire tree where >>>> there are >>>> >> truth values. >>>> >> >>>> >> Further, the counting code is storing and and fetching the truth >>>> values >>>> >> for each unique count value each time it gets incremented. So we'll >>>> have (1: >>>> >> an insert for 1), then (2: a delete of the 1, and an insert for 2), >>>> then (3: >>>> >> a delete of the 2, and an insert for 3), (4: a delete for 3, and an >>>> insert >>>> >> for 4), etc. And this is done for the EvaluationLink, ExecutionLink >>>> and four >>>> >> Word Nodes in the above example. >>>> >> >>>> >> I wanted to verify this assumption, so I created a fresh database and >>>> >> turned on statement logging in Postgres. It resulted in the attached >>>> log >>>> >> file for a test of the a four-word sentence: >>>> >> >>>> >> (observe-text "This is a test.") >>>> >> >>>> >> We got 5419 lines in the log file (not counting the 333 lines of >>>> startup >>>> >> creations of types etc.), each of which represents a single SQL >>>> statement. >>>> >> There are: >>>> >> >>>> >> 813 BEGIN, >>>> >> 813 COMMIT, >>>> >> 2127 SELECT, >>>> >> 720 DELETE, and >>>> >> 946 INSERT >>>> >> >>>> >> statements for a four word sentence. (See the attached text log for >>>> >> details) >>>> >> >>>> >> There are 133 new atoms created and 93 valuations after the call to >>>> >> observe-text is finished. >>>> >> >>>> >> >>>> > >>>> >>>> >>>> >>>> -- >>>> Ben Goertzel, PhD >>>> http://goertzel.org >>>> >>>> "I am God! I am nothing, I'm play, I am freedom, I am life. I am the >>>> boundary, I am the peak." -- Alexander Scriabin >>>> >>> >>> >> > -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA35vumixmCvgDkGxGY8WetE8rX-WAEgEVQqCF5DymwuMTw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
