[opencog-dev] Re: Language Learning - Progress and Performance

Curtis Faith Wed, 07 Jun 2017 20:09:33 -0700

I wrote:

I don't see why shutting down Guile is tantamount to shutting down the
>> system and restarting it. Right now, the atom space is created first,
>
>
Linas wrote:

> Ehh? Where?  Right now, guile is started first, then the opencog module is
> loaded, and this module creates the atomspace.  If you leave guile, you get
> the bash prompt; there is no atomspace any more; theres no running
> executable any more, after you leave guile.

I wrote:

> then the SchemeEval object is created in CogServer::CogServer.
>
>
Linas wrote:

> That's incorrect. The SchemeEval object is created when you (use-modules
> (oepncog exec))  I typically create this before running the cogserver.

Okay, so you're starting the Guile shell first and using it to launch the
CogServer. Obviously that makes it much harder to restart Guile without
losing context. What I said above is correct when one starts the CogServer
via the linux command line.

Starting the CogServer first is the much easier way to go if one is looking
to restart Guile itself.

There's no bug in the GC. I don't understand why you keep saying that, or
> what evidence you have for that.

What I said was: "Did you find the specific bug in the GC or Guile's use of
it that causes the fp to get screwed up?"

I don't know what the bug is. I know it is showing up for me during garbage
collection, so I've been referring to it as a GC bug for shorthand, if you
don't like that, I'll stop.

You've found the symptom which causes the hang, as have I, an infinite loop
in Guile's vm.c code but for me this loop is always happening during
garbage collection. I don't have any evidence that it is a bug in the bwdgc
code, indeed it seems much more likely to be in Guile, but it does seem to
be related to the intersection of Guile and it's use of the GC as perhaps
the GC is portion of the code that relies on Guile not having an infinite
loop while holding a GC lock. I don't know. I do know that Guile is doing
lots of tricky things with frames for garbage collection purposes, since
many of not most of the references to objects the GC will be scanning will
be in the various stacks, indeed there are two fields in the frame element
union specifically allocated to GC purposes.

So that's why I have been calling this a GC bug. If you prefer, I can refer
to it as the Frame Pointer Infinite Loop Bug from this point forward.

Still, in my experience, the fact that my hangs are always happening during
GC means something. Perhaps it is a red herring, and some random overwrite
that has nothing to do with Guile's interactions with bwdgc. It's just a
clue, not the answer.

The real question is what is the root cause for the frame pointer of a
Guile VM stack being incorrect?

How many CPUs are on your test machine?
>
> 24

Sounds like you're counting wall-clock time, not cpu-time.  So that is
> misleading you.

The GC time I measured is CPU time as reported by Guile's (gc-run-time).
The elapsed time is wall-clock time, sure, I know that. In fact, I'm going
to argue that it is the most relevant measure.

The perl script you or someone else wrote to send text to the CogServer,
send-one.pl, measures elapsed clock time. I've been using that since that
is the number I care about, i.e. how long does it take to parse and put
this group of sentences into the AtomSpace?

Since *the GC time is time when the other threads are suspended*, there is
a one-to-one correspondence between changes in time spent in the GC and
changes to elapsed time as measured by the wall clock.

There is no other OpenCog analysis or AtomSpace work going on while the GC
is running. If you are measuring 500 total CPU time and you have 24 CPUs
that means you've got 24 potential worker CPUs for time when the GC is not
running a stop-the-world collection, and only one while it runs the
stop-the-world collection.

Consider a simplified case:

If total work takes 100 CPU-seconds and you have 10 CPUs and GC time is
zero, then you get 100 CPU-seconds / 10 CPUs for 10 seconds of elapsed time.

If GC time to cleanup the objects created by the 10 CPUs is 1 CPU second,
then you have 1 second GC time *PLUS* 10 seconds for work or 11 seconds
elapsed time, and the GC is contributing 1 / 11 or 9% to the elapsed time,
not the 1% you'd get by dividing by the total CPU time 1 / 101.

If GC time is 5 seconds then you have 5 seconds of GC time plus 10 seconds
for work or 15 seconds elapsed time and the GC is contributing 5 / 15 or
33.3% to the elapsed time, not the 5 / 105 or 4.8% time spent in GC which
seems acceptable.

If GC time is 15 seconds, you get 15 / (15 + 10) or 60% time in GC, but the
CPU time looks like 15 / 115 or 13% which seems like a high but acceptable
number.

In your case, the fact that 23 CPUs are standing idle and one is working
while the GC is running may mean that the GC is only spending 70/500 of the
CPU time in your measurements, but the percentage of GC contribution to the
elapsed time is:

GC Time / (GC Time + ((Total CPU time - GC Time) / number of CPUs working)
+ other out-of-process work latencies)

which in the worst case could be something approaching: 70 / (70 + (430 /
24)) or 79.6%.

Keep in mind that I'm not doing any calls to Postgres at all for these
tests, so my percentages above probably high for that reason alone. Since
you do fetch-atom and store-atom calls, it is probable that your actual
percentage contribution is lower, perhaps much lower, than this 79.6%. I
don' know how much time any synchronous reads and writes contribute to
overall elapsed time in your case.

Finally, my point is that if you want to know how much the GC code is
impacting performance, then GC time / elapsed wall-clock time is the number
you actually care about since the GC suspends other working threads and any
work that reduces time spent in stop-the-world garbage collection will
result in a corresponding one-to-one reduction in elapsed wall-clock time.

On Wed, Jun 7, 2017 at 4:58 PM, Linas Vepstas <[email protected]>
wrote:

> I've got databases so large that they don't fit into RAM, and I've crashed
> my system several times because of this. I've had to regenerate them over
> an over, as I've discovered and patched various bugs and/or experienced
> data loss in several different ways. e.g. the wikipedia vs. gutenberg
> issue, several thunderstorms that scrambled SQL cause I ahdn't configured
> it for safety :-( stuff like that.  And I've done this for multiple
> languages... So I find it hard to believe that you're stuck on step zero.
> All of my issues are elsewhere.
>
> I can send you a medium-sized dataset. Its got some issues, still, but its
> usable.  Its ready to go, its got 8M distinct word pairs observed 270M
> times, and 7M distinct connector sets, observed 260M times.   It's small
> enough that it loads fairly fast, and you can load all of it, if you wish,
> it probably fits in within 64GB RAM, maybe
>
> --linas
>
> On Wed, Jun 7, 2017 at 1:43 AM, Ben Goertzel <[email protected]> wrote:
>
>> On Wed, Jun 7, 2017 at 1:37 PM, Linas Vepstas <[email protected]>
>> wrote:
>> >
>> > Well, you've got to add some kind of length limits.  How far apart do
>> you
>> > want the words to be?
>>
>> That's a parameter one can set, it could be n=10 or n=20 ... a while
>> ago we passed around some papers indicating the length of an average
>> dependency link in English, Chinese and other languages...
>>
>> >> The above could all be done in C++ perfectly well; it doesn't require
>> >> Guile because it doesn't require any of the fancy stuff in the current
>> >> NLP pipeline...
>> >
>> >
>> > What's the point? Why bother? why would you want to do this? what does
>> it
>> > accomplish? what does it solve?
>>
>> It would get us a way of doing Step 0 of the language learning pipeline
>> that
>>
>> -- works smoothly without hitting weird hard-to-solve Guile bugs and such
>>
>> -- grabs possible dependencies in a way that seems less wasteful to me
>> than generating a bunch of random parses (i.e. just building links btw
>> words that are reasonably near each other)
>>
>> Obviously this is not the interesting part of the language learning
>> algorithm ... but we need to get this first part working reliably and
>> reasonably rapidly to get to the other parts, right?   So far Ruiting
>> hasn't been able to get to the point of experimenting with clustering
>> and disambiguation algorithms because the Step 0 code hits these Guile
>> bugs.  But there's no need to have complex code or Guile code for Step
>> 0, all we need is simpler stuff...
>>
>> -- Ben
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAJzHpFp-naKEXvzHOSW43x7bX4sdjNBLmakQ5v%2B4Wgdtq0C07Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: Language Learning - Progress and Performance

Reply via email to