OK, so I think I now understand that a Chunk is defined as, minimally, an atomX and all the atoms in it's incoming set. In that case, then the "name" of the chunk" may as well be the hash of the central atomX. If a Chunk2 is defined as "an atom, its incoming set atoms, and their incoming set atoms," then the name of that chunk can be a hash of the central atoms hash concatenated to the hash of all the 1st generation atoms, and so on, turtles & hashes all the way down to ChunkN.
Except we have the problem that some process may add a new atom to,e.g., the 1st generation, so under the above strategy, we have a new hash of hashes for the chunks centered on atomX. This could be a feature, not a bug. It depends on what you want. If you want a name that dynamically refers to an atom and all it's incoming children to the Nth generation, then I guess that name is "atomX-N" no? But if that process that is adding atoms exists on some other node in our datomspace cluster, how can we know about that new atom? I'll flail stupidly toward some hopefully relevant ideas in the rest of this message. > The goal is that user A on the other side of the planet can agree with user B on the name of the chunk, without having to talk between each other first. What you really want is, of course, impossible in a distributed scenario, but we can get a close approximation. If it's not obvious why I think it's impossible, I can explain in a follow-up. In most commercial distributed data storage systems I know of (several), you have both a Partition Strategy (how do I decide which cluster node owns this record/atom) and a Replication Strategy (where do I put copies of this record so that I can recover it if the owner disappears, or so I can accept writes from multiple nodes?). Some systems also have a Consistency strategy (how many replicas do I have to look at in order to be sure of the present state of the record?) In this nomenclature, I would suggest that Chunks are not a Partitioning problem, but rather a Replication & Consistency problem. I suggest this because I believe some things that may not be true, so let me write out my assumptions so that you can ignore the rest of this message as soon as you hit an assumptions that doesn't hold: 1. Some cluster node will "own" each atom by assignment via some simple division of the hash address space. 2. Each cluster node will also contain replicas of many other atoms, not only for disaster recovery purposes, but also because mind agents on that node will need in local memory many atoms "owned" by other nodes. Once we've obtained them from their owners, we might as well keep them around until we need to recover memory space for other "borrowed" atoms more urgently needed. 3. A mind agent on a given node wants to be able to update atom properties (truth value, etc) locally, without having to talk to the "owner" node directly. 4. Perfect consistency of atom state between different nodes is not a strict requirement, but it is desirable for a node to be able to identify the 'authoritative' source for a given atom, and that source should reflect a reasonably recent state of the atom as updated by any replica node. 5. Relatively poor storage efficiency is acceptable. I.e., a single node may only be able to dedicate a relatively small portion of its memory to storing the atoms it owns; a majority of its space may go to replicated atoms. Nodes are cheap; we'll just buy more. :-) Given those design goals, I think we're looking at a publish-subscribe model for replicating updates to atoms. So, the owner for a given atom would also subscribe to updates for all atoms in the chunk (i.e., all atoms in the owned atom's incoming set) thus committing on a best-effort basis to maintain a reasonably up-to-date subgraph of all the nodes in the chunk, so that when some node in the cluster requests the chunk (by reference to the central atom's hash) it can also get reasonably recent copies of all the connected atoms. If a particular mind agent is very sensitive to consistency, it can, of course, take the time to request the authoritative state of each atom in the chunk from its owner, but (I assume) in most cases this won't be necessary. The mind agent can also choose to subscribe directly to the update stream from the authoritative node, if it desires to apply updates caused by other mind agents, or it can periodically request the chunk again from the central node owner, if it prefers to trade consistency/latency for bandwidth efficiency. See also: https://en.wikipedia.org/wiki/PACELC_theorem https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/architecture/archDataDistributeAbout.html https://en.wikipedia.org/wiki/Consistency_model#Consistency_and_replication All the Best, Matt -- Please interpret brevity as me valuing your time, and not as any negative intention. On Thu, Jul 23, 2020 at 3:54 PM Linas Vepstas <[email protected]> wrote: > > > On Thu, Jul 23, 2020 at 11:34 AM Ben Goertzel <[email protected]> wrote: > >> > What I am fishing for, is either some example pseudocode, or the name >> of some algorithm or some wikipedia article that describes that algorithm, >> which can compute chunks. Ideally, an algo that can run in fewer than a >> few thousand CPU cycles, because, after that, performance becomes a >> problem. If none of this, then some brainstorming for how to find a >> reasonable algorithm. >> > >> >> Linas, just to be sure we're in sync -- how large of a chunk are you >> thinking this algorithm would typically find? >> > > Arbitrary. If you look at what happened with opencog-ipfs or opencog-dht, > there are several key operations. One is, of course, "who's got this atom?" > but that's easy: each atom has a 64-bit hash (or 80-bit on opendht by > default, but that's settable). Next, "what's the incoming set of this > atom?" Whoops, can't compute the hash of that, because we don't know what > it is. So you can ask, and get back a list of N other atoms (or hashes) > that are in the incoming set. Where are they? Well, each different atom > gets a totally different hash, so they spread all over the planet (because > that's how Kademlia works), when in fact, what we really wanted to say was > "gee golly, the incoming set of an atom is 'close to' the atom itself, get > me the ball of close-by stuffs". But I can't figure out how to "say > that". > > Anyway, that is what I am trying to define as a chunk: an atom and > everything "nearby", with a variable conception of "nearby". > > atomspace-ipfs had multiple major stumbling blocks. One is that the IPFS > documents are immutable, so for each new atomspace, you have to publish a > brand new document -- which has a completely different hash, so whoops, how > do findout out the hash of that?. Well, IPFS has a DNS-like naming system, > but it was horridly slow, totally unusuable (multi-secnod lookups with > 60-second timeouts). The second problem is that its "centralized" -- you > have to jam the *entire* atomspace into the document. So its klunky. Won't > scale for large atomspaces. Some notion of chunks alleviates that. But > maybe something less klunky than IPFS would be better. > > So that suggests a lower-level building block - e.g. opendht. and that is > how atomspace-dht was born. But that now seems to be maybe "too low". It > suffered from the chunking problem. > > Here's one, somewhere in the middle: "earthstar" -- > https://github.com/cinnamon-bun/earthstar is a decentralized document > store. Cross out "document" and write "atomspace" instead. Or rather cross > out "atomspace" and write "chunk" instead. Or something like that. Quite > unclear. > > The reason atomspace-cog got created is it seems best to have "seeders", > same idea as in bittorent, so at least one server that is the source of > truth for a given atomspace, even if all the other servers are > down/offline. The current ipfs and dht backends do not use seeders, but > I've got extremely vague plans to change that. > > --linas > > > -- > Verbogeny is one of the pleasurettes of a creatific thinkerizer. > --Peter da Silva > > -- > You received this message because you are subscribed to the Google Groups > "opencog" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/opencog/CAHrUA34VPBdCw-hoXH-Y8RWa-eQRAjvUGQ%2BbRNSdU8wu95dbgA%40mail.gmail.com > <https://groups.google.com/d/msgid/opencog/CAHrUA34VPBdCw-hoXH-Y8RWa-eQRAjvUGQ%2BbRNSdU8wu95dbgA%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAPE4pjDMT9Oz%3DbmiAtZXsse9C1rvPCb61vj2DHcyuX0%3DtxxpKA%40mail.gmail.com.
