Hi Ben,

On Thu, Apr 22, 2021 at 5:33 PM Ben Goertzel <[email protected]> wrote:

> > If... well I can imagine that there are scenarios where having a DAS is
> useful .. but we don't have working code for any of those. The problem of
> building a technology like DAS for some non-existent users who might show
> up in the future... well, you might discover that the DAS is built wrong.
> It might have the wrong performance profile. It might be too clunky. It
> might have lots of features you don't need, and be missing features you
> do.   Designing and building things that don't have a current use is a very
> risky business.
>
>
> The bio-Atomspace we are experimenting with now contains only a small
> % of the biomedical knowledge we would like it to, which is because of
> RAM and processing speed limitations in current OpenCog
>
> Recent optimizations help but don't remotely come close to solving the
> problem
>

OK. Well, that's news to me. I try to keep everyone happy, and when there
aren't any comments or complaints, I assume everyone is happy.  Do you have
actual examples, where you are running out of RAM, and where things are
going too slow?  Or is this just a gut-feel issue, for which you have no
actual data?

I cannot repeat this often enough or strongly enough: the kinds of
optimizations that are performed on software systems are extremely
data-dependent and algorithm dependent. It is effectively impossible to
perform optimizations without having a specific use case. This is a kind-of
theorem of computer science.

>
> The neural-symbolic grammar learning that Andres Suarez and I
> prototyped last spring, also couldn't viably be done using OpenCog for
> similar reasons (RAM and processing speed limitations).


No one ever complained about RAM or processing speeds, so it's kind of
unfair to just bring this up a year later. I had the impression that the
theory you were developing wasn't working out; I wasn't surprised, but I
never fully understood it.

This spring, I restarted work on https://github.com/opencog/learn -- you
can review the README for the current status. I get good results. Its a big
project. Things go slowly. Not enough time in the day.


> If we could
> complete that work it would be very useful in our current humanoid
> robotics work w/ OpenCog (Awakening.Health)
>

I would like to help, but I suspect that the direction I've been going in
is not likely to match your requirements.  A good solution might not arrive
on the time-scale you want.

The experimentation on pattern mining from inference histories for
> automated inference control, that Nil was doing a year ago, was
> incredibly slow also due to Atomspace limitations.


Ben, that is also incredibly unfair. Never-ever did you or Nil or anyone
else ever complain about "atomspace limitations".  So you can't just start
blaming it now. If there is an actual performance problem, open a github
issue, and describe it. Provide instrumentation, bottlenecks.

I watched those projects from afar, and ... well, all I can say is "that's
not how I would have done it".  The fact that you had performance problems
is almost surely a statement about your algorithms, and not a statement
about the atomspace.  The atomspace is what it is, and if you use it
incorrectly, you'll get disappointing results. It's not a magic wand. It's
just software, like any other kind of software.


> It is possibly true that for each such case, one could design a
> specialized architecture to support just that case, working around the
> need for a general-purpose DAS in that particular case....
>

You are describing things that sound like (to me) inadequate or
inappropriate algorithms, and then switching the topic to DAS.  You don't
have to use the atomspace -- you could have done the inference mining on
any one of a half-dozen map-reduce platforms out there -- many of them from
the Apache.org people -- and you would not have gotten performance that is
any better than what the atomspace provides.

These complaints are reminiscent of decades worth paper magazine articles
(remember those?), blog-posts and marketing campaigns about big data,
scale-up vs scale-out, data mining, machine learning, no-sql vs sql vs
graph databases. There must be ten thousand white-papers and a hundred
thousand pages on this stuff. Nothing that I saw Shujing doing with pattern
mining was any different than what anyone else in the industry does when
they data-mine. This is common, every-day stuff. Everyone in the industry
experiences variations thereon.  If it didn't work out for you, maybe you
didn't have your nose to the grind-stone.

The advantage that google enjoys (besides having more money) is that they
can pair together PhD's who understand the problem, with engineers who
understand how to make the code run fast. Either way, those salaries are
huge, and they never have enough grunts to work on the visionary projects,
and serendipity plays a big role, as well as good management.  You can get
lucky some of the time, you can't get lucky all of the time.


> I understand we could proceed by writing
> fully-working-except-for-scalability-issues code for all the above
> applications I've alluded to (and more), and then analyzing all this
> code and its specific scalability issues, and using this analysis to
> drive design of an improved system...
>

I don't think that is what i'm saying.  That's not really how it's done,
when it's done in the industry. You could try that, but it's likely to
fail.  Why?  For starters, I'm not convinced that the algorithms are
correct.  If you have a combinatoric explosion (which I am guessing is what
is happening) then you have to address that first: you have several
options: (a) you can try to mitigate it (b) you can hunt for alternative
algorithms and data structures (c) redesign  it for specialty hardware
(e.g. GPU's) (d) ...

No one ever builds "working except for scalability" code, and then "just
scales it". That is not how it's done.

>
> Instead we are indeed aiming to proceed in a faster but in some senses
> more risky way, by creating a design that appears to us capable of
> scalably carrying out applications such as the above


Given that I don't understand the "applications such as those above", I
don't know how to respond. You would have to describe those applications in
engineering terms, in order to understand how they could be implemented so
as to run efficiently and scalably ... without an actual description of
what it is, it's not a solvable problem. There's just an insufficient
amount of detail.


>
> >> -- a piece of the distributed persistent Atomspace, say in RocksDB
> >>
> >> Then of course the AI process on machine X can query the local
> >> Atomspace specifically as it wishes.   If it wants to query the
> >> persistent backing store, it can query RocksDB and it will get faster
> >> response if the answer happens to be stored in the fragment of RocksDB
> >> that is living on the same machine as it.   There will be some bias
> >> toward the portion of the distributed RocksDB on machine X having
> >> Atoms that relate to the Atoms in the local Atomspace on machine X ...
> >> but this depends on the inner workings of RocksDB.   Or at least
> >> that's how I'd think it would work, this is speculative...
> >
> >
> > This is not quite how it works, but, roughly speaking, I've got
> prototypes that do this today.
>
> What are the main differences btw what I described above and what your
> prototypes do?
>

Rocks does not do sharding across the network.

If you have different fragments of an atomspace dataset on 10 different
networked machines, and you want to write a pattern match that will run
across all of those machines in parallel, and join together the results, I
could write that snippet of code in the proverbial afternoon. It's so
simple, in fact, that it could be written as an example, to add to the set
of examples. (actually, I think one of the examples already does this, more
or less. Actually, its a mashup of these two demos:
https://github.com/opencog/atomspace/blob/master/examples/atomspace/persist-multi.scm
and
https://github.com/opencog/atomspace/blob/master/examples/atomspace/persist-query.scm
)

That's not the hard part. The hard part is a ladder of requirements:
* How do you get the shards of data onto those machines? Do you use rsync
to copy files, or do you want to send them via atomspace? If you use rsync,
then where will you keep the script for it?
* Where do you keep the list of the currently-active set of 10 machines?
Do you need a GUI for that? A phone app?
* What do you do if one or more of them hasn't booted, or has crashed?
* Are they password-protected? The atomspace is not password-protected!

There are atomspace issues:
* The simplest solution is to wait until all ten have returned results, and
then join them together.
* Another possibility is to let the results dribble in, and join them as
they arrive.  This is more complex, and requires more sophistication.  The
10-line demo program now becomes a 100 or 200 line program.
* What if one of the machines has crashed during processing? e.g. bad
network card, failed disk, power outage?
* Perhaps you want to load-balance, so that the slowest machine is not
always the bottleneck.  This requires measuring each machine to see if it
is idle or not, and giving it more work if it is idle.  This is
non-trivial.  Most engineers would do this outside of the atomspace, but
you could also do it inside the atomspace if you write custom Atoms for it.
Does your design require custom Atoms for load-balancing?
* Perhaps the dataset is badly sharded, so that one of the machines is
always a bottleneck. This requires not only finding the busiest machine,
but then re-sharding the data. Many databases do this automatically. The
conventional way in which this is done is to find a sequence of "least
cruel cuts" in the Tononi sense, and move those to other machines. Find the
cuts that hurt Phi the least. Talking about phi is fancy-pants
buzzword-slinging, but all the people who do data-mining have a very
intuitive understanding of Tononi's Phi, and have had that understanding
many, many decades ago, because it's key to both software and hardware
optimization. This is easy to say, but finding those cuts is hard to do.
Nothing in opencog today does this automatically.  However, I can imagine
several possible solutions, ranging from real easy ones to really complex
ones, each having pros and cons. Vendors like Oracle have had solutions for
this, for decades. They've invested hundreds of man-years into it.
* There's more. I wanted to mention concepts like "explain vacuum analyze"
and "query planning" but perhaps some other day. Everyone gets to solve the
query planning problem, including Hyperon. There's no free lunch.

Then there are the data-design issues and meta-issues
* Perhaps you are storing data as atoms, that should have been Values.
Values are a lot faster than Atoms, but they get this performance with a
set of function trade-offs.
* Perhaps your data should not be kept in the atomspace at all. This
includes audio, video live-streams, text files, medical records, and a
zillion other data types.
* Perhaps you want to run SciPy on text summarizers, or write tensorflow
algos. There are 1001 software platforms that are tuned for stuff like
that. Use the tool that is appropriate.



> Your help on the bio-Ai project is much appreciated!  However, I would
> note that Mike Duncan doesn't have the facility w/ tuning and fixing
> Atomspace/OpenCog code


Mike is the one who asked for help. I worked with him as the manager in
charge of the project, and not as the technical contact.  90% of the code
was written by Habush, and he is the one I interacted with regularly, and a
little bit with Hedra Seid.  I like Habush. He enjoys exploring brand new
shiny things.


> that Vitaly and some of his St. Petersburg
> colleagues do....


Again, this code is not a part of the atomspace, its not even a part of
opencog. Vitaly & friends have never looked at it, never touched it.


>  I would venture there are less likely to be
> relatively quick fixes to apparent brick walls that Vitaly etc. run up
> against.  But I'd be happy to be refuted by reality on this -- we will
> be using Original OpenCog in Awakening.Health for quite some time and
> so having it work better and better is definitely valuable to us...
>

None of the agi-bio code is a part of opencog.   That is not the code that
got fixed!



> Hmm, at a high level we did guess a pattern cache was going to be
> useful -- and Senna implemented one some time ago.


The concept of a cache is about as generic as the concept of a loop or an
if statement.  You are effectively saying that Senna thought of using
if-statements and loops, and implemented a program that used them. That's
just crazy-making!

It's a shame I have to end the email like this, but this is the kind of
mis-communication that we habitually engage in. I find it difficult, at
times. The alternative is to ignore these kinds of comments, but that's not
terribly productive, either.

-- Linas

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA36_m4U0c1fveXbpjLC8tzcuKpYeqO1PF8R4qHAMifHMZQ%40mail.gmail.com.

Reply via email to