Re: [opencog-dev] Distributed Atomspace

Abdulrahman Semrie Wed, 29 Jul 2020 06:35:28 -0700

 > I think it's a mistake to try to think of a distributed atomspace as one 
super-giant, universe-filling uniform, undifferentiated blob of storage.


It is not clear to me why this is a mistake. It obviously has its use 
cases.  When I think of a distributed atomspace, I think of multiple 
atomspaces that are partitioned & replicated over multiple nodes (the first 
case with mutiple atomspace support in #2138 
<https://github.com/opencog/atomspace/issues/2138>). To give an example, 
say we want to store human genomic variant data for both reference builds,  
hg19 and hg38 (this link 
<https://gatk.broadinstitute.org/hc/en-us/articles/360035890951-Human-genome-reference-builds-GRCh38-or-hg38-b37-hg19>
 
explains what the genome reference builds are). The variant data for each 
reference will go into two separate atomspaces. But since these atomspaces 
need to store 100GB+ data, they need to be distributed over multiple nodes, 
each node holding portions of each atomspace. A local client will then 
specify which atomspace it wants to query by id (something like "hg38") and 
send a query (You can see elements of this design in the atomspace-rpc 
code). From the client's perspective it is as if it is querying a single 
large atomspace,  but the query engine takes care of coming up with an 
execution plan (optimizing the query if it has to) and getting the results 
from the partitions and returning it to the client. Also when a user wants 
get variant data from the ghrc38 atomspace, only partitions of ghrc38 
should be searched/queried.

> I think we'll run into all sorts of conceptual difficulties and design 
problems if you try to do that. If nothing else, it starts smelling like 
quorum-sensing in bacteria. Which is not an efficient way to communicate. 
You don't want broadcast messages going out to the whole universe.

I suggest you to look into the design docs of Nebula graph DB, which is a 
strongly typed distributed graph db. I believe they address the above 
issues you mentioned and  it is possible to implement something similar for 
the first version of the distributed Atomspace.   Here are the links

[Overview] - 
https://docs.nebula-graph.io/manual-EN/1.overview/3.design-and-architecture/1.design-and-architecture/
 


[Storage Design] - 
https://docs.nebula-graph.io/manual-EN/1.overview/3.design-and-architecture/2.storage-design/
 
- part of this currently implemented through the Postgres backend as 
demonstrated in this example 
<https://github.com/opencog/atomspace/blob/master/examples/atomspace/distributed.scm>

[Query Engine] - 
https://docs.nebula-graph.io/manual-EN/1.overview/3.design-and-architecture/3.query-engine/
 
- esp.  interesting how they implement access control through sessions, 
which partly relates to #1855 
<https://github.com/opencog/atomspace/issues/1855>

They implement sharding somewhat similar to what you described above using 
Edge-Cut - storing a destination vertex and all its incoming edges in the 
same partition, a source vertex and its outgoing edges in the same 
partition. They use Multi-raft groups (" Multi-Raft only means we manage 
multiple Raft consensus groups on one node") to achieve consistency across 
partitions for multiple databases. This is contrary to what you suggested 
in that each node doesn't broadcast its changes, only the elected leader 
will broadcast changes (i.e send log requests) and the rest of the nodes 
will update their partitions accordingly. Of course, a new leader can be 
elected if the current leader fails or it term ends. The above design also 
solves what you noted as the "unsolved part" in #2138 
<https://github.com/opencog/atomspace/issues/2138>

Anyways, I think implementing something similar to Nebula db as an initial 
version has immediate benefits for projects that use the atomspace to store 
and process out-of-RAM data such the genomic data.

On Wednesday, July 29, 2020 at 9:09:08 AM UTC+3 Matt Chapman wrote:

> >I think it's a mistake to try to think of a distributed atomspace as one 
> super-giant, universe-filling uniform, undifferentiated blob of storage. 
>
> > You don't want broadcast messages going out to the whole universe.
>
> Not sure if you intended to imply it, but the reality of the first 
> statmentt need not require the 2nd statement. Hashes of atoms/chunks can be 
> mapped via modulo onto hashes of peer IDs so that messages need only go to 
> one or few peers.
>
> Specialization has a cost, in that you need to maintain some central 
> directory or gossip protocol so that peers can learn which other peers are 
> specialized to which purpose.
>
> An ideal general intelligence network may very well include both a large 
> number of generalist, undifferentiated peers and clusters of highly 
> interconnected specialized peers. If peers are neurons, I think this 
> describes the human nervous system also, no?
>
> To borrow terms from my previous messsge, generalist peers own many atoms, 
> and replicate few, while specialist peers own few or none, but replicate 
> many.
>
> Matt
>
>
>
> On Tue, Jul 28, 2020, 10:36 PM Linas Vepstas <[email protected]> wrote:
>
>>
>>
>> On Tue, Jul 28, 2020 at 11:41 PM Ben Goertzel <[email protected]> wrote:
>>
>>>
>>>
>>> Hmm... you are right that OpenCog hypergraphs have natural chunks
>>> defined by recursive incoming sets.   However, I think these chunks
>>> are going to be too small, in most real-life Atomspaces, to serve the
>>> purpose of chunking for a distributed Atomspace
>>>
>>> I.e. it is true that in most cases the recursive incoming set of an
>>> Atom should all be in the same chunk.  But I think we will probably
>>> need to deal with chunks that are larger than the recursive incoming
>>> set of a single Atom, in very many cases.
>>>
>>
>> I like the abstract to the Ja-be-ja paper, will read and ponder. It 
>> sounds exciting.
>>
>> But ... the properties of a chunk depends on what you want to do with it. 
>>
>> For example: if some peer wants to declare a list of everything it holds, 
>> then clearly, creating a list of all of its atoms is self-defeating. But if 
>> some user wants some specific chunk, well, how does the user ask for that? 
>> How does the user know what to ask for?   How does the user say "hey I want 
>> that chunk which has these contents"?  Should the user say "deliver to me 
>> all chunks that contain Atom X"? If the user says this, then how does the 
>> peer/server know if it has any checks with Atom X in it?  Does the 
>> peer/server keep a giant index of all atoms it has, and what chunks they 
>> are in? Is every peer/server obliged to waste some CPU cycles to figure out 
>> if it's holding Atom X?  This gets yucky, fast.
>>
>> This is where QueryLinks are marvelous: the Query clearly states "this is 
>> what I want" and the query is just a single Atom, and it can be given an 
>> unambiguous, locally-computable (easily-computable; we already do this)  
>> 80-bit or a 128-bit (or bigger) hash and that hash can be blasted out to 
>> the network (I'm thinking Kademlia, again) in a compact way - its not a lot 
>> of bytes.  The request for the "query chunk" is completely unambiguous, and 
>> the user does not have to make any guesses whatsoever about what may be 
>> contained in that chunk.  Whatever is in there, is in there. This solves 
>> the naming problem above.
>>
>>
>>> What happens when the results for that (new) BindLink query are spread
>>> among multiple peers on the network in some complex way?
>>>
>>
>> I'm going to avoid this question for now, because "it depends" and "not 
>> sure" and "I have some ideas".
>>
>> My gut impulse is that the problem splits into two parts: first, find the 
>> peers that you want to work with, second, figure out how to work with those 
>> peers. 
>>
>> The first part needs to be fairly static, where a peer can advertise "hey 
>> this is the kind of data I hold, this is the kind of work I'm willing to 
>> perform." Once a group of peers is located, many of the scaling issues go 
>> away: groups of peers tend to be small.  If they are not, you organize them 
>> hierarchically, they way you might organize people, with specialists for 
>> certain tasks. 
>>
>> I think it's a mistake to try to think of a distributed atomspace as one 
>> super-giant, universe-filling uniform, undifferentiated blob of storage. I 
>> think we'll run into all sorts of conceptual difficulties and design 
>> problems if you try to do that. If nothing else, it starts smelling like 
>> quorum-sensing in bacteria. Which is not an efficient way to communicate. 
>> You don't want broadcast messages going out to the whole universe. Think 
>> instead of atomspaces connecting to one-another like dendrites and axons: a 
>> limited number, a small number of connections between atomspaces,  but 
>> point-to-point, sharing only the data that is relevant for that particular 
>> peer-group.
>>
>> -- Linas
>>
>> -- 
>> Verbogeny is one of the pleasurettes of a creatific thinkerizer.
>>         --Peter da Silva
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "opencog" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/opencog/CAHrUA35zN4aaSrZ2Dpu4qLUL1bYfjAF_rGiS_xxg2-E-SBqY3Q%40mail.gmail.com
>>  
>> <https://groups.google.com/d/msgid/opencog/CAHrUA35zN4aaSrZ2Dpu4qLUL1bYfjAF_rGiS_xxg2-E-SBqY3Q%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/03c6d738-ba41-4a77-9bcd-bb1a1a338476n%40googlegroups.com.

Re: [opencog-dev] Distributed Atomspace

Reply via email to