Re: Decentralized building blocks [was Re: [opencog-dev] Distributed Atomspace

Linas Vepstas Tue, 11 Aug 2020 11:31:45 -0700

Hi Ben,

The reason I called these "building blocks" is because they make apparent
the need for something you've been calling "agents", - a "remembering
agent", a "forgetting agent", a "sharing agent" - I'm using these terms in
a slightly different fashion than you do, so let me explain.

Let me start small. So, right now, you can take the cogserver, start it on
a large-RAM machine, and have half-a-dozen other atomspaces connect to it.
They can request Atoms, do some work, push those Atoms back to the
cogserver. This gives you a distributed atomspace, as long as everything
fits in the RAM of the cogserver, and you don't turn the power off.

The next obvious step is to enable storage-to-disk for the cogserver.  This
is where the first design difficulties show up.  If a cogserver is running
out of RAM, it should save some Atoms to disk, and clear them out of RAM.
Which ones?  Answer: write a "remembering agent" that implements some
policy for doing this.  There is no need to modify the cogserver itself, or
any of the clients, to create this agent: so it's a nice, modular design.
That's why I called the existing pieces "building blocks": with this third
piece, this "remembering agent", one gets a truly functional small
distributed system.

If things don't fit on one disk, or if there are hundreds of clients
instead of dozens, then one needs a "sharing agent" to implement some
policy for sharing portions of an atomspace across multiple cogservers.
Again, this is an orthogonal block of code, and one can imagine having
different kinds of agents for implementing different kinds of policies for
doing this.

OK -- so that's the easy part -- the grand concept of modular design. Two
modules are done: the module for "save to disk" (aka atomspace-rocks) and
"communicate-over-network" (aka atomspace-cog) Also Xabush is working on a
different kind of communicate-over-network module, and that's OK, too.

Let's now consider the simplest agent - the "remembering agent" that
sometimes moves atoms from RAM to disk, and then frees up RAM.  How should
that work? Well, it's surprisingly ... tricky.  One could stick a timestamp
on each Atom (a timestamp Value) and store the oldest ones. But this eats
urp RAM, to store the timestamp, and then eats more RAM to keep a sorted
list of the oldest ones. If we don't keep a sorted list, then we have to
search the atomspace for old ones, and that eats CPU.  Yuck and yuck.

Every time a client asks for an Atom, we have to update the timestamp (like
the access timestamp on a unix file.)  So unix files have three timestamps
to aid in decision-making - created, modified and accessed. It works
because most files are huge compared to the size of the timestamps. For us,
the Atoms are the same size as the timestamps, so we have to be careful not
to mandate that every atom must have some meta-data.

There's another problem. Suppose some client asks for the incoming set of
some Atom. Well, is that in RAM already, or is it on disk, and needs to be
fetched? The worst-case scenario is to assume it's not, and always re-fetch
from disk. But this hurts performance. (what's the point of RAM, if we're
always going to disk???) Can one be more clever? How?

There's a third problem: "vital" data vs "re-creatable" data. For example,
the genomics dataset itself is "vital" in that if you erase anything, it's
a permanent "dataloss".  As it is being used, the MOZI genomics code-base
performs searches, and places the search results into the atomspace, as a
cache, to avoid re-searching next time. These search results are
"re-creatable".   Should re-creatable data be saved to disk? Sometimes?
Always? Never? If one has a dozen Values attached to some Atom, how can you
tell which of these Values are "vital", and which are "recreatable"?

So -- I've sketched three different problems that even the very simplest
agent must solve to make even the simplest distributed system.   The
obvious solution is not very good, the good solution is not obvious.

The ideas from ECAN and attention values can help some of these problems,
but they're not enough in themselves.  More subtle approaches, that
actually take into account RAM, CPU and network usage are needed.  The
"good news" is that these are "classic OS problems" -- there is a very long
history of these kinds of problems, as anyone who writes an operating
system and/or a database and/or a cloud infrastructure has to solve exactly
these kinds of problems. It a very active area of research - witness the
clash between the "Microsoft Open Service Mesh" vs. the "Google Istio"
service mesh: these are both "agent" systems that are solving exactly the
kinds of problems I'm talking about, and they are light-years ahead of the
AtomSpace in terms of sophistication, because there are more than billions
of dollars at stake in getting these "agents" to work well.

Well, that's it. A few random end-notes:
 -- an unrelated problem is that the existing opencog "agents"
infrastructure is horrible, and needs to be consigned to oblivion.
-- I'm planning on creating an "atomspace-agents" repo real soon now, as a
research-area/dumping ground for the kind of agents I describe above.

-- Linas

On Tue, Aug 11, 2020 at 12:41 AM Ben Goertzel <[email protected]> wrote:

> We want a large Atomspace, parts of which are in RAM on various
> machines, parts of which are in persistent storage, and the ability to
> run a variety of queries and processes across this whole Atomspace.  I
> posted something about the "distributed PLN inference" use-case on
> this list not long ago.
>
> The current "distributed Atomspace" functionality is cool but it
> doesn't do this yet.  It could be the foundation for a system doing
> the above, but it might also hit some serious problems.   Matt is
> pointing out how Cassandra could potentially help work around some of
> these problems with its adjustable levels of consistency.
>
> Coordinating a network of distributed sub-Atomspace via a postgres or
> RocksDB backing store in a hub-and-spokes architecture seems like it's
> not going to do what we need ultimately...
>
> The document Matt Chapman linked above in this thread is the result of
> a lot of thought by a number of us, and I think explains the above
> points much more thoroughly than I could do in this brief email (plus
> a bunch of other points I didn't get to in this email)
>
> ben
>
> On Mon, Aug 10, 2020 at 11:57 AM Matt Chapman <[email protected]>
> wrote:
> >>
> >>
> >> >> Does it meet the 7 business requirements in Ben's document:
> https://docs.google.com/document/d/1n0xM5d3C_Va4ti9A6sgqK_RV6zXi_xFqZ2ppQ5koqco/edit
> ?
> >
> >
> > > I have no clue.  I've never seen this document before.   It's only the
> 41st document on this topic, and I'm suffering from reader-fatigue. Care to
> summarize what it says?
> >
> > Provide effective management of AtomSpaces that are too big to fit in
> RAM of any one machine that is available.
> >
> > Decrease the overall processing time required to carry out AI operations
> to reduce cost per AI operation.
> >
> > Decrease memory footprint providing better overall throughput in
> comparing with current implementation to reduce cost per AI operation.
> >
> > Provide ability to use AtomSpace in the manner of hierarchical cache
> structure. In other words, provide a way to look for a specific Atom
> locally before start searching it among other components and fetching
> remotely.
> >
> > Provide ability to request for Atom(s) based on a given Atom's property.
> >
> > Provide ability to request for subgraphs based on patterns.
> >
> > To isolate application layer from source code modification keeping the
> AtomSpace API as-is or with minor changes.
> >
> >
> > The idea I'm suggesting, which I readily admit is worth 1/1000th of the
> effort required to implement it, is that a Cassandra-like architecture
> provides a very good solution for requirements number 4 & 5, and possibly a
> foundation for #6. It also provides #1 for some definition of "effective,"
> arguably better than any centralized architecture, for some definition of
> "better." :-) It may very likely fail at #2 and #3 compared to current
> alternatives; we won't know until someone builds & benchmarks it, and
> that's 1000x more effort...
> >
> > If you think that those requirements are already adequately served by
> existing solutions, then I will stop adding noise to the conversation.
> Otherwise, I'm happy to share more of my experiences if it might be helpful
> in formulating an approach.
> >
> > [Aside Req. 1 here is in fundamental conflict with 2 & 3; usually "we"
> accept a local performance penalty in exchange for distributed &
> decentralized scalability. But Cassandra's Tunable Consistency model is the
> only way I know to expose this trade-off to the user on a per-query basis,
> which seems quite powerful to me, for the Atomspace use case. The value of
> Tunable Consistency (relative to its cost to implement) may be the thing
> I've failed to convince you of, in which case, I certainly trust your
> opinion more than mine.]
> >
> > All the Best,
> >
> > Matt
> >
> > --
> > Please interpret brevity as me valuing your time, and not as any
> negative intention.
> >
> >
> > On Thu, Aug 6, 2020 at 11:18 AM Linas Vepstas <[email protected]>
> wrote:
> >>
> >>
> >>
> >> On Thu, Aug 6, 2020 at 11:30 AM Matt Chapman <[email protected]>
> wrote:
> >>>
> >>>
> >>> I've been hearing people talk about the need for distributed atomspace
> on and off for 8+ years,
> >>
> >>
> >> Mee too. This was a head-scratcher, since we had a distributed
> atomspace. So I was never sure why they talked about it.
> >>
> >>>
> >>> and I've never seen an answer along the lines of "you can already have
> a cluster, here's the documentation on how to set it up."
> >>
> >>
> >> Here's the tutorial for it:
> https://github.com/opencog/atomspace/blob/master/examples/atomspace/distributed-sql.scm
> >>
> >> I changed the name of the tutorial 5 days ago, because we now have not
> one, not two, but four different distributed atomspace solutions (of which
> two don't scale well)
> >>
> >> The instructions to set up each of the four are here:
> >>
> >> The oldest one, which is SQL-based:
> >> https://github.com/opencog/atomspace/tree/master/opencog/persist/sql
> >>
> >> The newest one, which is cogserver-based, and my current favorite:
> >> https://github.com/opencog/atomspace-cog
> >>
> >> The IPFS one, which is the one I love to hate:
> >> https://github.com/opencog/atomspace-ipfs
> >>
> >> The DHT one, which I hope to revive maybe if we get a good chunking
> solution:
> >> https://github.com/opencog/atomspace-dht
> >>
> >>>
> >>>
> >>> Does it meet the 7 business requirements in Ben's document:
> https://docs.google.com/document/d/1n0xM5d3C_Va4ti9A6sgqK_RV6zXi_xFqZ2ppQ5koqco/edit
> ?
> >>
> >>
> >> I have no clue.  I've never seen this document before.   It's only the
> 41st document on this topic, and I'm suffering from reader-fatigue. Care to
> summarize what it says?
> >>
> >> Performance: did anyone run any of the benchmarks on any of the
> distributed AtomSpaces that we currently have?  We *do* have benchmarks for
> them. They're in https://github.com/opencog/benchmark/
> >>
> >> -- Linas
> >>
> >> --
> >> Verbogeny is one of the pleasurettes of a creatific thinkerizer.
> >>         --Peter da Silva
> >>
> >> --
> >> You received this message because you are subscribed to the Google
> Groups "opencog" group.
> >> To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected].
> >> To view this discussion on the web visit
> https://groups.google.com/d/msgid/opencog/CAHrUA35PqjOXv8uu8QnnAYDx%3D3KDSgoX6w-cRtcnCFLz%3DZKYPw%40mail.gmail.com
> .
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups "opencog" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected].
> > To view this discussion on the web visit
> https://groups.google.com/d/msgid/opencog/CAPE4pjAZMcDWXNCvxR-v_KCVJxyg-DVskwhN7kQMekQHjMKidw%40mail.gmail.com
> .
>
>
>
> --
> Ben Goertzel, PhD
> http://goertzel.org
>
> “The only people for me are the mad ones, the ones who are mad to
> live, mad to talk, mad to be saved, desirous of everything at the same
> time, the ones who never yawn or say a commonplace thing, but burn,
> burn, burn like fabulous yellow roman candles exploding like spiders
> across the stars.” -- Jack Kerouac
>
> --
> You received this message because you are subscribed to the Google Groups
> "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/opencog/CACYTDBdxT_pYSeoMSMD4znFdXjvUOsXOmTr22gQj1xP195SZ9Q%40mail.gmail.com
> .
>

-- 
Verbogeny is one of the pleasurettes of a creatific thinkerizer.
        --Peter da Silva

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA35%2B-s3EADe0VDr1u%2BeKd%3Do0AD2VNb0%2BC0Hu_Fr2KKZ16Q%40mail.gmail.com.

Re: Decentralized building blocks [was Re: [opencog-dev] Distributed Atomspace

Reply via email to