[opencog-dev] Re: Language Learning - Progress and Performance

Linas Vepstas Mon, 05 Jun 2017 12:11:56 -0700

Long message...

On Sat, Jun 3, 2017 at 10:07 PM, Curtis Faith <[email protected]>
wrote:

> Linas wrote:
>
>
>> The idea that you are going to use a special pool for guile which you
>> then clear out ever so often is just .. a proposal to take a sophisticated
>> GC algorithm and replace it with a truly sophmoric ..a ahem freshman
>> concept of GC. Its a waste of time.
>>
>
> GC is needed when you have long-running processes or threads that can't
> just leak with impunity. It is wholly unnecessary for short-duration tasks
> with moderate memory requirements. In the special case of processing batch
> requests, with web servers like nginx or Apache or the CogServer running
> observe-text, there are not many objects that can't immediately be
> destroyed when the request (or sets of those requests) finish. That makes
> the overhead of GC unnecessary in these batch processing cylcles. It also
> makes the overhead of finalizing anything unnecessary if there is enough
> RAM to service the requests without any cleanup during a single requests
> processing. You need to release system resources and that's about it.
>
> A pool-based cleanup--once-at-the-end approach will make things much
> faster whether the problem I am seeing ends up being from a bug or not.
>

OK. So lets think this through. The only place where GC is being used is in
guile; GC is not being used in the atomspace itself. So you could
accomplish exactly the same thing by periodically shutting down guile,
completely. This would release all that memory, and then you are done.

The problem with this proposal is that pretty much everything runs through
guile.  All the atoms go in through it, and come out through it.  So
shutting down guile and restarting it is tantamount to shutting down the
system, and restarting it.  Which is OK, if you saved all the atoms you
care about to the database.

But I don't see any way of implementing a pool, without fully shutting down
guile; and if one fully shuts down guile, then one doesn't need a pool.

The only alternatives are to use python, by python is single-threaded, so
this is a non-starter. The 3rd alternative is haskell, but it's also
garbage-collelcted.  Can't use C++ because it has no interpreted
command-line. (Using C++ is tantamount to shutting everything down,
recompiling, and restarting everything, which is clearly the worst-possible
scenario).  A 5th alternative is to invent a custom vocabulary of words to
control C++ objects from an interactive command line, but this is clearly a
design disaster. It was 1988, when the folly of this was realized, which
cause tcl to be integrated into C apps The inadequacy of TCL lead to the
invention of guile ... and so here we are, full-circle.  We could switch to
swig+perl, or to javascript.. but javascript is garbage-collected, and I
think perl is too, not sure. I just don't see any way of implementing what
you are talking about.

>
> Performance is clearly an issue for unsupervised learning
>

Really? Sorry, but in what way? What's the  problem?

> and general AI in general. It is also an issue for OpenCog right now since
> getting the data into the AtomSpace in the right form is taking far too
> long right now.
>

Really? In what way? What is the problem?

> No matter what we do performance will always be an issue because of the
> sheer size of the datasets researchers want to work with.
>

Well, if those folks at Intel and AMD weren't so lazy, we'd have great
performance by now.

> That is why Ben first asked me to look into using AWS to spawn parallel
> processes to cut down on the calendar time required to input large corpora.
>

Well, we know Ben is crazy. This is not where the problem lies.  Its easy
to get large corpora pumped through. I can give you a dozen dumps of
datasets so large they won't fit in the RAM of your computer. Do you want
large datasets? Cause I got them.

The problem is that I don't have tools to analyze those datasets. That's
where 95% of my personal bottleneck lies.  Simply crunching a lot of data
is just so totally not at all the hard part.

> I'm seeing 50% to 70% times spend in the GC.

Are you using the tool I sent you? because I am seeing less than 20%.

>>> >>>>> I totally
>>>
>>
Sorry for some of the sarcasm. Sure, more performance would always be nice,
but GC time is a complete red herring.  Also, technically, I think GC is
not a solvable problem. The alternative is reference counting, and that is
also a total CPU hog.

I do have some proposals, but first: 1) I have large datasets, 2) creating
large datasets is totally not an issue. 3) creating tools to analyze them
is almost 100% of the issue.

But if you wanted to get atoms into the atomspace faster: like 10x faster
or 20x faster: you could run the link-grammar parser in the same address
space as the atomspace.  Just take what it spits out, convert them  into
atoms, shove the atoms into the atomspace. This would completely by-pass
guile, and bypass all GC.  So GC would totally not be an issue in this
case.

To be clear: currently, LG parses text, then bloody java code turns it into
strings, which are sent over a socket to guile, which evaluates the
strings, and creates atoms. About 80% or more of this process is the cost
of having guile evaluate strings that specify atoms, in string format.
Eliminate this, and you get an instant 3x, 4x speedup.

Once you did this, you'd discover two other bottlenecks: shoving atoms into
the atomspace is slowwwww. And pushing atoms out to the database is
slowwww.  These are much harder, but more important bottlenecks to overcome.

Re: running LG in the same adress space as the atomspace: this has already
been done; the surreal code does this. In a day or 2 or 3 you could write
the needed wrapper code to have LG live directly inside of opencog,
generating the correct atoms, thus totally bypassing  guile and garbage
collection.  And this would be a very easy way to get a 3x speedup, if
that's really your end-goal.  Its a lot wasier than all the other crazy
schemes discussed.

In the very-long term, I plan to do this anyway, because I want to apply
the LG algorthms to generic atomspace data, not just to natural language.
However, curently LG is totally focused only on lanugage, and its too much
work to re-implement it as a generic data parser.  Baby steps, for now.

--linas

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA37oZMatK-HRFcVF9PA_LKpJGA74erfZGKae6uvoM5HyXg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: Language Learning - Progress and Performance

Reply via email to