> So when someone says "database XYZ will solve all of the atomspaces
problems", I mostly don't believe it.


If you think this is what I'm saying by describing Cassandra's
architectural choices, then you are missing the point.

It was not so when I first discovered OpenCog, but in the years since, I've
gained many man-months of hands on experience with PostgresSQL, wen more
than my experience with Cassandra, and Postgres is a fantastic piece of
software, if what you need is an ACID-compliant relational database that
can fit on a single machine. It's various forms of distributed operation,
of which I have limited experience, seem to range from OK to terrible. And
the OK ones (proxied partitioning) leave the hardest problems unsolved (how
to partition intelligently) and create scaling bottlenecks (the
proxy/coordinator).

Cassandra, and many other modern non-relational data stores, are designed
from the ground up to operate in distributed clusters over unreliable
networks, and are willing to flex on things like ACID in order to gain
performance for specialized use cases that don't require strict ACID. They
are rarely interchangeable with each other, and they are almost never
interchangable with a fully-consistent ACID-compliant relational database,
if that's what you really need.

None of the use cases so far described convince me that the atomspace needs
a fully-consistent ACID-compliant relational database. So I'm not
suggesting a different one; i'm suggesting not to use one.

I'm not saying, "use Cassandra instead of Postgres" (although I think that
could have benefits for some workloads).

I'm saying, "use a Data distribution architecture like Cassandra's, instead
of Kademelia." (Although Kademelia may be a fine choice for the small part
of that architecture that does require a global DHT).

Or, "Use a tiered  pub-sub messaging paradigm, with a distribution
architecture like Kafka or Pulsar, instead of trying to maintain global
state at all." (I'm less sure about this, but it makes more sense to me as
the independence of peer-subgroups increases.)

Matt

On Wed, Jul 29, 2020, 4:05 PM Linas Vepstas <[email protected]> wrote:

>
>
> On Wed, Jul 29, 2020 at 4:16 PM Matt Chapman <[email protected]>
> wrote:
>
>>
>> Is there a public document somewhere describing actual, present use-cases
>> for distributed atomspace? Ideally with some useful guesses at performance
>> requirements, in terms of updates per second to be processed on a single
>> node and across the cluster, and reasonably estimated hardware specs (num
>> cores, ram, disk) per peer?
>>
>
> No, because more or less no one does this. As far as I know, there are two
> cases: some experiments by the Ethiopians working on MOZI, and my work.
> Briefly, then ...
>
> In my case, it was language-learning ... one process chews through large
> quantities of text, and counts, incrementing counts on atoms, for
> days/weeks. Things like power-outages, or just plain screwups can result in
> the loss of days/weeks of processing, so backing to disk is vital. Once you
> have a dataset, may want to run different experiments on it, so want to be
> able to make copies.
>
> Of course, I was using postgres for this. The experience was
> underwhelming. It was fairly easy to go into a mode where postgres is using
> 80% of the CPU, the atomspace is using 20%. Worse, the speed bottleneck was
> limited by rotating disk drives, so upgrading to SSD disks gave a huge
> performance boost (I don't recall the number) Even then, I found that
> traffic in the SATA link between disk and motherboard was 50% with bursts
> of 100% of grand-total SATA bandwidth -- so apparently postgres can
> saturate that just fine.  This is why PCIe SSD is interesting - PCIe has
> much higher bandwidth than SATA.
>
> Then there's postgres tuning. There are many things you can tweak in
> postgres, and one of them is to turn off the fsync flag (or something like
> that, I forget now) so things get buffered instead of being written to disk
> right away. This gives a huge performance boost -  sounds great! ... until
> the first thunderstorm knocked out power. The resulting database was
> corrupted, even the postgres tools could not recover it. So -- several
> weeks of work down the tubes. Of course, there are 100 warnings on 100
> blogs that say you should never do this unless your disk-drive has
> batteries built into them. Which is how I know that you can get disk drives
> with batteries in them :-)
>
> Now, you could respond by saying "oh postgres is bad, blah blah blah, you
> should use database X, blah blah blah it's so much better" but I find those
> arguments utterly unconvincing. One reason is I don't think the postgres
> people are stupid. The other is that the atomspace has a very unusual data
> structure. So -- it's very often the case that "normal" databases store
> objects that are thousands of bytes long, or larger -- say, javascript
> snippets, or gifs, or blog posts,  user-comments, web-pages, product
> ratings. Even when the data is small -- the data structure is nearly
> trivial "joe blow gave product X a five-star rating" -  a triple (name,
> product, rating).  The atomspace is almost the anti-universe of this: the
> atoms are almost always small - a dozen bytes, but have very complex
> inter-relationships, of atoms connected to atoms in every-which way in
> giant tangles. (Imagine the word "the" and a link to every sentence that
> word appears in. Then imagine links between other words in those sentences
> ...  most words are 3-7 bytes long. The tangle of what words are connected
> to what others is almost hopeless. This is also true for the genomics
> data.  I recently counted the number of gene-pairs, gene-triangles and gene
> tetragons in Habush's genome dataset, the number of tetragons is huge...
> but I digress)  So when someone says "database XYZ will solve all of the
> atomspaces problems", I mostly don't believe it.
>
> ----
> The MOZI experience...
>
> MOZI had a very different experience (but they said very little about it,
> so I'm not sure).  From what I can tell: (1) they struggled to configure
> postgres (2) the resulting performance was poor, (3) They needed a
> read-only underlay, and read-write overlays. (4) they probably needed user
> logins. (5) and a data dashboard.
>
> Point 3 is that you have a genomic dataset -- large atomspace -- and a
> dozen scientists plinking on it. So, you can either make a dozen copies --
> or more -- each scientist might run several experiments a day - so 12
> scientists x 2 copies/scientist/day x 200 days/year x 50GB per copy = a big
> number.  Another approach is to have a single read-only atomspace, with
> read-write overlays, so everyone shares the read-only base, and makes
> changes that are isolated from the other scientists. The atomspace can now
> do this. The downside is that if you lose power ... and you also need a
> big-RAM machine, and each scientist might want to consume a lot of CPU, and
> there's still some lock contention in the atomspace, so if everyone is
> trying to modify everything all at the same time, you end up spending a lot
> of time waiting on locks.  I've tried to performance-tune the living
> daylights out of the atomspace, but workload-specific issues always remain
> .. some workloads are very different than others.
>
> A third approach is one that Habush is working on now: a central server
> for the read-only copy, and local users copying over those subsets that
> they need for their particular experiments.  Details are TBD, but we've got
> several deuling approaches: I keep blathering about cogserver and backends
> because it works, and is proven, he's trying a different set of
> architectures, which is a good thing, because it's unhealthy to get trapped
> in Linas' tiny thought-bubble. You want to go out and explore the world.
>
> Point (4) user logins ... no clue.  There's nothing in the atomspace
> regarding data protection, data security, partitioned access, safety,
> verifiability, loss recovery, homomorphic computing :-)  The security model
> for the atomspace is out of the 1960's - if you have physical access to the
> atomspace, you can do anything at all.
>
> Point (5) data dashboard -- a nice GUI that lists all of the atomspaces
> you've got, which ones are live, which ones are archived, what's in them,
> how big are they? What jobs are running now? Are they almost done? Do I
> have enough RAM/CPU to launch one more? (never mind, do I have enough
> disk/SATA bandwidth?)  Imagine one of those youtube ads with smiling
> millennials pretending to run a business by swiping right on their cell
> phone.
>
> Point (5) is a HUGE GIANT BIG DEAL. I've spent years  pumping data through
> the atomspace and it's like sitting in front of a laundry machine watching
> your clothes spin round. It's boring as heck, but if you don't watch,
> something goes wrong or you forget what you're doing or you can't find the
> results from last week because you forgot where you put them. This is a
> major issue. I slog through this with a 100% manual process because I'm a
> masochist, but it is not a tenable state of affairs.
>
> And more or less everyone is ignoring this, although MOZI does have
> assorted GUI's and dashboards for genomics data.
>
> -- linas
>
> --
> Verbogeny is one of the pleasurettes of a creatific thinkerizer.
>         --Peter da Silva
>
> --
> You received this message because you are subscribed to the Google Groups
> "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/opencog/CAHrUA373AXOXpv_YNASvZa6W1qwCpKPFm80Zc-Dr0gkC1pCE7w%40mail.gmail.com
> <https://groups.google.com/d/msgid/opencog/CAHrUA373AXOXpv_YNASvZa6W1qwCpKPFm80Zc-Dr0gkC1pCE7w%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAPE4pjBr0OUZMUiRs6JxWtucLve4MZ1KKV0DdR7F%2BkaieANebQ%40mail.gmail.com.

Reply via email to