Re: [opencog-dev] Distributed Atomspace

Matt Chapman Wed, 29 Jul 2020 13:22:43 -0700

I don't know your performance requirements, but I always thought one way to
do a distributed atomspace would simply be to have a bunch of independent
atomspaces that all share one distributed Cassandra database as the "disk"
storage layer.


Note that I continue to reference the name Cassandra because it is better
known, but if you were going to adopt a third-party datastore whole cloth,
I do recommend the Scylla C++ implementation of Cassandra, having used it
in production for realtime, ML systems at moderate scale (6+ nodes in my
case, though it is documented to scale to hundreds, as I recall).




On Wed, Jul 29, 2020, 11:59 AM Ben Goertzel <[email protected]> wrote:

> Matt,
>
> I looked at Cassandra some time ago, haven't used it in practice though...
>
> You are pointing it out here as a source of design ideas/inspirations,
> but I'm also wondering: Do you think it would be a strong choice as an
> ingredient in an OpenCog Hyperon (next-gen OpenCog) distributed
> Atomspace?   We have been looking at Apache Ignite which serves a
> different purpose, and of course the two have been integrated
> https://apacheignite-mix.readme.io/docs/ignite-with-apache-cassandra
> as well...
>
> It looks like graph databases aren't going to be apropos for the
> persistent storage component in Hyperon, and key-value stores are
> probably the right level to be looking at...
>
> I haven't thought through how the various levels of non-ACID
> consistency in Cassandra might help with distributed Atomspace,
>
>
> https://blog.yugabyte.com/apache-cassandra-lightweight-transactions-secondary-indexes-tunable-consistency/
>
> ben
>
> On Wed, Jul 29, 2020 at 11:19 AM Matt Chapman <[email protected]>
> wrote:
> >
> > > Which peers?
> > As determined by a token ring:
> >
> >
> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/architecture/archDataDistributeDistribute.html
> >
> > I think you could almost replace "vnode" with "chunk" if you wanted to
> adopt the Cassandra architecture, although I wouldn't be surprised to see
> performance problems with a huge number of vnodes, so it might actually
> need to be a "chunk-hash modulo reasonable number of vnodes".
> >
> >  > How do you find them?
> >
> > By calculating the partition token via consistent hash, as Cassandra
> does with Murmur3. This tells you the authoritative source for the chunk
> you want. You might also have a local cache of other peers that have had
> replicas of that chunk, in case any of them are more responsive to you.
> Cassandra calls this process of finding potential replicas "Snitching".
> >
> >
> >  > You are thinking Kademlia (as do I, when I think of publishing) or
> OpenDHT or IPFS.
> >
> > Nope. I've only played with IPFS a bit, but I don't expect it to be
> performance for the atomsoace use case. I'm only vaguely familiar with
> openDHT; it seems worth exploring, but I'm sure you understand it far
> better than I do.
> >
> > I'm not very familiar with p2p systems like kademlia, but I suspect
> that's optimized for consistency & availability over performance, so not
> the right choice for datomspace.
> >
> > By this point, it should be clear that I look to Cassandra for how
> semi-conistent distributed data storage systems should be designed. (Fwiw,
> my inspiration for distributed messaging systems comes mostly from Apache
> Kafka.)
> >
> >
> > > Which is great, if all you're doing is publishing small amounts of
> static, infrequently-changing information.  Not so much, if interacting or
> blasting out millions of updates.  Neither system can handle that --
> literally -- tried that, been there, done that. They are simply not
> designed for that.
> >
> > Cassandra is.  To be fair, Cassandra is optimized for massive scale,
> with may involve some trade-offs that are not desirable for present-day
> atomspace use cases.
> >
> > See also, ScyllaaDB for a C++ reimplementation of Cassandra.
> >
> > > Now, perhaps using only a hash-driven system, it is possible to
> overcome these issues. I do not know how to do this. Perhaps someone does
> -- perhaps there are even published papers ... I admit I did not do a
> careful literature search.
> >
> > http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
> >
> http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf
> >
> > Matt
> >
> >
> >
> > On Wed, Jul 29, 2020, 9:37 AM Linas Vepstas <[email protected]>
> wrote:
> >>
> >>
> >>
> >> On Wed, Jul 29, 2020 at 1:09 AM Matt Chapman <[email protected]>
> wrote:
> >>>
> >>> >I think it's a mistake to try to think of a distributed atomspace as
> one super-giant, universe-filling uniform, undifferentiated blob of storage.
> >>>
> >>> > You don't want broadcast messages going out to the whole universe.
> >>>
> >>> Not sure if you intended to imply it, but the reality of the first
> statmentt need not require the 2nd statement. Hashes of atoms/chunks can be
> mapped via modulo onto hashes of peer IDs so that messages need only go to
> one or few peers.
> >>
> >>
> >> Which peers?  How do you find them? You are thinking Kademlia (as do I,
> when I think of publishing) or OpenDHT or IPFS. Which is great, if all
> you're doing is publishing small amounts of static, infrequently-changing
> information.  Not so much, if interacting or blasting out millions of
> updates.  Neither system can handle that -- literally -- tried that, been
> there, done that. They are simply not designed for that.
> >>
> >> Now, perhaps using only a hash-driven system, it is possible to
> overcome these issues. I do not know how to do this. Perhaps someone does
> -- perhaps there are even published papers ... I admit I did not do a
> careful literature search.
> >>
> >> But, basically, before we are even out of the gate, we already have a
> snowball of problems with no obvious solution.  Haven't even written any
> code, and are beset by technical problems. That's not an auspicious
> beginning.
> >>
> >> If you have something more specific, let me know. Right now, I simply
> don't know how to do this.
> >>
> >> --linas
> >>>
> >>>
> >>> Specialization has a cost, in that you need to maintain some central
> directory or gossip protocol so that peers can learn which other peers are
> specialized to which purpose.
> >>>
> >>> An ideal general intelligence network may very well include both a
> large number of generalist, undifferentiated peers and clusters of highly
> interconnected specialized peers. If peers are neurons, I think this
> describes the human nervous system also, no?
> >>>
> >>> To borrow terms from my previous messsge, generalist peers own many
> atoms, and replicate few, while specialist peers own few or none, but
> replicate many.
> >>>
> >>> Matt
> >>>
> >>>
> >>>
> >>> On Tue, Jul 28, 2020, 10:36 PM Linas Vepstas <[email protected]>
> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Jul 28, 2020 at 11:41 PM Ben Goertzel <[email protected]>
> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> Hmm... you are right that OpenCog hypergraphs have natural chunks
> >>>>> defined by recursive incoming sets.   However, I think these chunks
> >>>>> are going to be too small, in most real-life Atomspaces, to serve the
> >>>>> purpose of chunking for a distributed Atomspace
> >>>>>
> >>>>> I.e. it is true that in most cases the recursive incoming set of an
> >>>>> Atom should all be in the same chunk.  But I think we will probably
> >>>>> need to deal with chunks that are larger than the recursive incoming
> >>>>> set of a single Atom, in very many cases.
> >>>>
> >>>>
> >>>> I like the abstract to the Ja-be-ja paper, will read and ponder. It
> sounds exciting.
> >>>>
> >>>> But ... the properties of a chunk depends on what you want to do with
> it.
> >>>>
> >>>> For example: if some peer wants to declare a list of everything it
> holds, then clearly, creating a list of all of its atoms is self-defeating.
> But if some user wants some specific chunk, well, how does the user ask for
> that? How does the user know what to ask for?   How does the user say "hey
> I want that chunk which has these contents"?  Should the user say "deliver
> to me all chunks that contain Atom X"? If the user says this, then how does
> the peer/server know if it has any checks with Atom X in it?  Does the
> peer/server keep a giant index of all atoms it has, and what chunks they
> are in? Is every peer/server obliged to waste some CPU cycles to figure out
> if it's holding Atom X?  This gets yucky, fast.
> >>>>
> >>>> This is where QueryLinks are marvelous: the Query clearly states
> "this is what I want" and the query is just a single Atom, and it can be
> given an unambiguous, locally-computable (easily-computable; we already do
> this)  80-bit or a 128-bit (or bigger) hash and that hash can be blasted
> out to the network (I'm thinking Kademlia, again) in a compact way - its
> not a lot of bytes.  The request for the "query chunk" is completely
> unambiguous, and the user does not have to make any guesses whatsoever
> about what may be contained in that chunk.  Whatever is in there, is in
> there. This solves the naming problem above.
> >>>>
> >>>>>
> >>>>> What happens when the results for that (new) BindLink query are
> spread
> >>>>> among multiple peers on the network in some complex way?
> >>>>
> >>>>
> >>>> I'm going to avoid this question for now, because "it depends" and
> "not sure" and "I have some ideas".
> >>>>
> >>>> My gut impulse is that the problem splits into two parts: first, find
> the peers that you want to work with, second, figure out how to work with
> those peers.
> >>>>
> >>>> The first part needs to be fairly static, where a peer can advertise
> "hey this is the kind of data I hold, this is the kind of work I'm willing
> to perform." Once a group of peers is located, many of the scaling issues
> go away: groups of peers tend to be small.  If they are not, you organize
> them hierarchically, they way you might organize people, with specialists
> for certain tasks.
> >>>>
> >>>> I think it's a mistake to try to think of a distributed atomspace as
> one super-giant, universe-filling uniform, undifferentiated blob of
> storage. I think we'll run into all sorts of conceptual difficulties and
> design problems if you try to do that. If nothing else, it starts smelling
> like quorum-sensing in bacteria. Which is not an efficient way to
> communicate. You don't want broadcast messages going out to the whole
> universe. Think instead of atomspaces connecting to one-another like
> dendrites and axons: a limited number, a small number of connections
> between atomspaces,  but point-to-point, sharing only the data that is
> relevant for that particular peer-group.
> >>>>
> >>>> -- Linas
> >>>>
> >>>> --
> >>>> Verbogeny is one of the pleasurettes of a creatific thinkerizer.
> >>>>         --Peter da Silva
> >>>>
> >>>> --
> >>>> You received this message because you are subscribed to the Google
> Groups "opencog" group.
> >>>> To unsubscribe from this group and stop receiving emails from it,
> send an email to [email protected].
> >>>> To view this discussion on the web visit
> https://groups.google.com/d/msgid/opencog/CAHrUA35zN4aaSrZ2Dpu4qLUL1bYfjAF_rGiS_xxg2-E-SBqY3Q%40mail.gmail.com
> .
> >>>
> >>> --
> >>> You received this message because you are subscribed to the Google
> Groups "opencog" group.
> >>> To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected].
> >>> To view this discussion on the web visit
> https://groups.google.com/d/msgid/opencog/CAPE4pjCyzOcoRAOPj7aGsj_73dAUnWovbjeaM4qjeM43hzXA6A%40mail.gmail.com
> .
> >>
> >>
> >>
> >> --
> >> Verbogeny is one of the pleasurettes of a creatific thinkerizer.
> >>         --Peter da Silva
> >>
> >> --
> >> You received this message because you are subscribed to the Google
> Groups "opencog" group.
> >> To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected].
> >> To view this discussion on the web visit
> https://groups.google.com/d/msgid/opencog/CAHrUA36esvtcgGrZ%3D4rCVMDde74TYKF1%3DS-AwLG95UYrT5Mdrg%40mail.gmail.com
> .
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups "opencog" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected].
> > To view this discussion on the web visit
> https://groups.google.com/d/msgid/opencog/CAPE4pjALKeWmpzxwoYR7gCmS5ZcDqrrKPaB0V-UZe814G6cwTA%40mail.gmail.com
> .
>
>
>
> --
> Ben Goertzel, PhD
> http://goertzel.org
>
> “The only people for me are the mad ones, the ones who are mad to
> live, mad to talk, mad to be saved, desirous of everything at the same
> time, the ones who never yawn or say a commonplace thing, but burn,
> burn, burn like fabulous yellow roman candles exploding like spiders
> across the stars.” -- Jack Kerouac
>
> --
> You received this message because you are subscribed to the Google Groups
> "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/opencog/CACYTDBdhS0wXDfHMVFJ7R7vwoXn01uGPvT%3D-UT_yo6T5rtN0Gw%40mail.gmail.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAPE4pjBNz04W_-c_pRH_OxKHgupc_yY6%3DkW%3DDSOuFaEtErCVUw%40mail.gmail.com.

Re: [opencog-dev] Distributed Atomspace

Reply via email to