[opencog-dev] Re: Language Learning - Progress and Performance

Linas Vepstas Wed, 07 Jun 2017 21:01:01 -0700

Look,

I'm don't think we are making forward progress.


1) if you start the cogserver first, then all you have is that crappy
telnet interface to guile, which sucks. I tried to make it nice, but having
to use it still sucks.  So I mostly avoid it.

2) the prame-pointer loops is the *symptom*. you don't know what the bug
is. Its like calling "pain" an illness: its not, its the symptom of an
illness.

3) You're obsessing with GC because you can measure it. If instead, you had
high-quality mesurements for the amount of cpu-time wasted on reference
counting, you'd obsess with that.  Or the amount of time wasted waiting on
a lock to insert an atom into the atomspace, you could obsess about that.
Or the amount of time wasted, waiting for an atom to the SQL server.  I
suspect all of those times are worse than GC.

I particular, I suspect that the reason the computatiion does run on all 24
cpus is because most of the time is wasted waiting to get the atomspace
lock. But I don't have hard data for that.  Getting hard data for that has
not been a priority.

One of my computations gets slower and slower the bigger the atomspace
gets, and its so disasterous, I've never gotten past about getting 40% done
with the computation.  The performance drops to a crawl when the atomspace
gets more than about 2M atoms in it.  I have just no clue why.  The amount
of time spent in GC is stable, its about 1/10th of the time spent in the
computation.  The amount of cpu time consumed is stable, so I don't think
its stalled on a lock. The amount of sql i/o drops and drops as the amount
of computation drops. I have no clue where the bottleneck is, and its
driving me crazy.

The point is, there are bottlenecks all over the place, just everywhere.
Maybe they're in guile, hard to tell.

Anyway, there is a way of avoiding guile entirely: perform all of the
parsing in the same address space as the atomspace; that way, guile never
gets involved. Done right, this wouldn't be a bad thing, but the way Ben
talks about it, it makes me nervous that it will be a horrid hack job that
I'll get to fix, instead of doing something fun.  I get tired of taking out
other people's garbage, and I'm just very nervous that we'll generate a lot
of garbage here.

--linas

On Wed, Jun 7, 2017 at 10:08 PM, Curtis Faith <[email protected]>
wrote:

> I wrote:
>
> I don't see why shutting down Guile is tantamount to shutting down the
>>> system and restarting it. Right now, the atom space is created first,
>>
>>
> Linas wrote:
>
>> Ehh? Where?  Right now, guile is started first, then the opencog module
>> is loaded, and this module creates the atomspace.  If you leave guile, you
>> get the bash prompt; there is no atomspace any more; theres no running
>> executable any more, after you leave guile.
>
>
> I wrote:
>
>> then the SchemeEval object is created in CogServer::CogServer.
>>
>>
> Linas wrote:
>
>> That's incorrect. The SchemeEval object is created when you (use-modules
>> (oepncog exec))  I typically create this before running the cogserver.
>
>
> Okay, so you're starting the Guile shell first and using it to launch the
> CogServer. Obviously that makes it much harder to restart Guile without
> losing context. What I said above is correct when one starts the CogServer
> via the linux command line.
>
> Starting the CogServer first is the much easier way to go if one is
> looking to restart Guile itself.
>
> There's no bug in the GC. I don't understand why you keep saying that, or
>> what evidence you have for that.
>
>
> What I said was: "Did you find the specific bug in the GC or Guile's use
> of it that causes the fp to get screwed up?"
>
> I don't know what the bug is. I know it is showing up for me during
> garbage collection, so I've been referring to it as a GC bug for shorthand,
> if you don't like that, I'll stop.
>
> You've found the symptom which causes the hang, as have I, an infinite
> loop in Guile's vm.c code but for me this loop is always happening during
> garbage collection. I don't have any evidence that it is a bug in the bwdgc
> code, indeed it seems much more likely to be in Guile, but it does seem to
> be related to the intersection of Guile and it's use of the GC as perhaps
> the GC is portion of the code that relies on Guile not having an infinite
> loop while holding a GC lock. I don't know. I do know that Guile is doing
> lots of tricky things with frames for garbage collection purposes, since
> many of not most of the references to objects the GC will be scanning will
> be in the various stacks, indeed there are two fields in the frame element
> union specifically allocated to GC purposes.
>
> So that's why I have been calling this a GC bug. If you prefer, I can
> refer to it as the Frame Pointer Infinite Loop Bug from this point forward.
>
> Still, in my experience, the fact that my hangs are always happening
> during GC means something. Perhaps it is a red herring, and some random
> overwrite that has nothing to do with Guile's interactions with bwdgc. It's
> just a clue, not the answer.
>
> The real question is what is the root cause for the frame pointer of a
> Guile VM stack being incorrect?
>
>
> How many CPUs are on your test machine?
>>
>> 24
>
>
> Sounds like you're counting wall-clock time, not cpu-time.  So that is
>> misleading you.
>
>
> The GC time I measured is CPU time as reported by Guile's (gc-run-time).
> The elapsed time is wall-clock time, sure, I know that. In fact, I'm going
> to argue that it is the most relevant measure.
>
> The perl script you or someone else wrote to send text to the CogServer,
> send-one.pl, measures elapsed clock time. I've been using that since that
> is the number I care about, i.e. how long does it take to parse and put
> this group of sentences into the AtomSpace?
>
> Since *the GC time is time when the other threads are suspended*, there
> is a one-to-one correspondence between changes in time spent in the GC and
> changes to elapsed time as measured by the wall clock.
>
> There is no other OpenCog analysis or AtomSpace work going on while the GC
> is running. If you are measuring 500 total CPU time and you have 24 CPUs
> that means you've got 24 potential worker CPUs for time when the GC is not
> running a stop-the-world collection, and only one while it runs the
> stop-the-world collection.
>
> Consider a simplified case:
>
> If total work takes 100 CPU-seconds and you have 10 CPUs and GC time is
> zero, then you get 100 CPU-seconds / 10 CPUs for 10 seconds of elapsed time.
>
> If GC time to cleanup the objects created by the 10 CPUs is 1 CPU second,
> then you have 1 second GC time *PLUS* 10 seconds for work or 11 seconds
> elapsed time, and the GC is contributing 1 / 11 or 9% to the elapsed time,
> not the 1% you'd get by dividing by the total CPU time 1 / 101.
>
> If GC time is 5 seconds then you have 5 seconds of GC time plus 10 seconds
> for work or 15 seconds elapsed time and the GC is contributing 5 / 15 or
> 33.3% to the elapsed time, not the 5 / 105 or 4.8% time spent in GC which
> seems acceptable.
>
> If GC time is 15 seconds, you get 15 / (15 + 10) or 60% time in GC, but
> the CPU time looks like 15 / 115 or 13% which seems like a high but
> acceptable number.
>
> In your case, the fact that 23 CPUs are standing idle and one is working
> while the GC is running may mean that the GC is only spending 70/500 of the
> CPU time in your measurements, but the percentage of GC contribution to the
> elapsed time is:
>
> GC Time / (GC Time + ((Total CPU time - GC Time) / number of CPUs working)
> + other out-of-process work latencies)
>
> which in the worst case could be something approaching: 70 / (70 + (430 /
> 24)) or 79.6%.
>
> Keep in mind that I'm not doing any calls to Postgres at all for these
> tests, so my percentages above probably high for that reason alone. Since
> you do fetch-atom and store-atom calls, it is probable that your actual
> percentage contribution is lower, perhaps much lower, than this 79.6%. I
> don' know how much time any synchronous reads and writes contribute to
> overall elapsed time in your case.
>
> Finally, my point is that if you want to know how much the GC code is
> impacting performance, then GC time / elapsed wall-clock time is the number
> you actually care about since the GC suspends other working threads and any
> work that reduces time spent in stop-the-world garbage collection will
> result in a corresponding one-to-one reduction in elapsed wall-clock time.
>
>
> On Wed, Jun 7, 2017 at 4:58 PM, Linas Vepstas <[email protected]>
> wrote:
>
>> I've got databases so large that they don't fit into RAM, and I've
>> crashed my system several times because of this. I've had to regenerate
>> them over an over, as I've discovered and patched various bugs and/or
>> experienced data loss in several different ways. e.g. the wikipedia vs.
>> gutenberg issue, several thunderstorms that scrambled SQL cause I ahdn't
>> configured it for safety :-( stuff like that.  And I've done this for
>> multiple languages... So I find it hard to believe that you're stuck on
>> step zero.  All of my issues are elsewhere.
>>
>> I can send you a medium-sized dataset. Its got some issues, still, but
>> its usable.  Its ready to go, its got 8M distinct word pairs observed 270M
>> times, and 7M distinct connector sets, observed 260M times.   It's small
>> enough that it loads fairly fast, and you can load all of it, if you wish,
>> it probably fits in within 64GB RAM, maybe
>>
>> --linas
>>
>> On Wed, Jun 7, 2017 at 1:43 AM, Ben Goertzel <[email protected]> wrote:
>>
>>> On Wed, Jun 7, 2017 at 1:37 PM, Linas Vepstas <[email protected]>
>>> wrote:
>>> >
>>> > Well, you've got to add some kind of length limits.  How far apart do
>>> you
>>> > want the words to be?
>>>
>>> That's a parameter one can set, it could be n=10 or n=20 ... a while
>>> ago we passed around some papers indicating the length of an average
>>> dependency link in English, Chinese and other languages...
>>>
>>> >> The above could all be done in C++ perfectly well; it doesn't require
>>> >> Guile because it doesn't require any of the fancy stuff in the current
>>> >> NLP pipeline...
>>> >
>>> >
>>> > What's the point? Why bother? why would you want to do this? what does
>>> it
>>> > accomplish? what does it solve?
>>>
>>> It would get us a way of doing Step 0 of the language learning pipeline
>>> that
>>>
>>> -- works smoothly without hitting weird hard-to-solve Guile bugs and such
>>>
>>> -- grabs possible dependencies in a way that seems less wasteful to me
>>> than generating a bunch of random parses (i.e. just building links btw
>>> words that are reasonably near each other)
>>>
>>> Obviously this is not the interesting part of the language learning
>>> algorithm ... but we need to get this first part working reliably and
>>> reasonably rapidly to get to the other parts, right?   So far Ruiting
>>> hasn't been able to get to the point of experimenting with clustering
>>> and disambiguation algorithms because the Step 0 code hits these Guile
>>> bugs.  But there's no need to have complex code or Guile code for Step
>>> 0, all we need is simpler stuff...
>>>
>>> -- Ben
>>>
>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA372WY_afZ62EiOSY%2BuptuEDHfOmsEcBGQU6FrcbgfAO3w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: Language Learning - Progress and Performance

Reply via email to