Re: [opencog-dev] Distributed Atomspace

Linas Vepstas Wed, 29 Jul 2020 16:06:25 -0700

On Wed, Jul 29, 2020 at 4:16 PM Matt Chapman <[email protected]> wrote:

>
> Is there a public document somewhere describing actual, present use-cases
> for distributed atomspace? Ideally with some useful guesses at performance
> requirements, in terms of updates per second to be processed on a single
> node and across the cluster, and reasonably estimated hardware specs (num
> cores, ram, disk) per peer?
>

No, because more or less no one does this. As far as I know, there are two
cases: some experiments by the Ethiopians working on MOZI, and my work.
Briefly, then ...

In my case, it was language-learning ... one process chews through large
quantities of text, and counts, incrementing counts on atoms, for
days/weeks. Things like power-outages, or just plain screwups can result in
the loss of days/weeks of processing, so backing to disk is vital. Once you
have a dataset, may want to run different experiments on it, so want to be
able to make copies.

Of course, I was using postgres for this. The experience was underwhelming.
It was fairly easy to go into a mode where postgres is using 80% of the
CPU, the atomspace is using 20%. Worse, the speed bottleneck was limited by
rotating disk drives, so upgrading to SSD disks gave a huge performance
boost (I don't recall the number) Even then, I found that traffic in the
SATA link between disk and motherboard was 50% with bursts of 100% of
grand-total SATA bandwidth -- so apparently postgres can saturate that just
fine. This is why PCIe SSD is interesting - PCIe has much higher bandwidth
than SATA.

Then there's postgres tuning. There are many things you can tweak in
postgres, and one of them is to turn off the fsync flag (or something like
that, I forget now) so things get buffered instead of being written to disk
right away. This gives a huge performance boost - sounds great! ... until
the first thunderstorm knocked out power. The resulting database was
corrupted, even the postgres tools could not recover it. So -- several
weeks of work down the tubes. Of course, there are 100 warnings on 100
blogs that say you should never do this unless your disk-drive has
batteries built into them. Which is how I know that you can get disk drives
with batteries in them :-)

Now, you could respond by saying "oh postgres is bad, blah blah blah, you
should use database X, blah blah blah it's so much better" but I find those
arguments utterly unconvincing. One reason is I don't think the postgres
people are stupid. The other is that the atomspace has a very unusual data
structure. So -- it's very often the case that "normal" databases store
objects that are thousands of bytes long, or larger -- say, javascript
snippets, or gifs, or blog posts, user-comments, web-pages, product
ratings. Even when the data is small -- the data structure is nearly
trivial "joe blow gave product X a five-star rating" - a triple (name,
product, rating). The atomspace is almost the anti-universe of this: the
atoms are almost always small - a dozen bytes, but have very complex
inter-relationships, of atoms connected to atoms in every-which way in
giant tangles. (Imagine the word "the" and a link to every sentence that
word appears in. Then imagine links between other words in those sentences
... most words are 3-7 bytes long. The tangle of what words are connected
to what others is almost hopeless. This is also true for the genomics
data. I recently counted the number of gene-pairs, gene-triangles and gene
tetragons in Habush's genome dataset, the number of tetragons is huge...
but I digress) So when someone says "database XYZ will solve all of the
atomspaces problems", I mostly don't believe it.

----
The MOZI experience...

MOZI had a very different experience (but they said very little about it,
so I'm not sure). From what I can tell: (1) they struggled to configure
postgres (2) the resulting performance was poor, (3) They needed a
read-only underlay, and read-write overlays. (4) they probably needed user
logins. (5) and a data dashboard.

Point 3 is that you have a genomic dataset -- large atomspace -- and a
dozen scientists plinking on it. So, you can either make a dozen copies --
or more -- each scientist might run several experiments a day - so 12
scientists x 2 copies/scientist/day x 200 days/year x 50GB per copy = a big
number. Another approach is to have a single read-only atomspace, with
read-write overlays, so everyone shares the read-only base, and makes
changes that are isolated from the other scientists. The atomspace can now
do this. The downside is that if you lose power ... and you also need a
big-RAM machine, and each scientist might want to consume a lot of CPU, and
there's still some lock contention in the atomspace, so if everyone is
trying to modify everything all at the same time, you end up spending a lot
of time waiting on locks. I've tried to performance-tune the living
daylights out of the atomspace, but workload-specific issues always remain
.. some workloads are very different than others.

A third approach is one that Habush is working on now: a central server for
the read-only copy, and local users copying over those subsets that they
need for their particular experiments. Details are TBD, but we've got
several deuling approaches: I keep blathering about cogserver and backends
because it works, and is proven, he's trying a different set of
architectures, which is a good thing, because it's unhealthy to get trapped
in Linas' tiny thought-bubble. You want to go out and explore the world.

Point (4) user logins ... no clue. There's nothing in the atomspace
regarding data protection, data security, partitioned access, safety,
verifiability, loss recovery, homomorphic computing :-) The security model
for the atomspace is out of the 1960's - if you have physical access to the
atomspace, you can do anything at all.

Point (5) data dashboard -- a nice GUI that lists all of the atomspaces
you've got, which ones are live, which ones are archived, what's in them,
how big are they? What jobs are running now? Are they almost done? Do I
have enough RAM/CPU to launch one more? (never mind, do I have enough
disk/SATA bandwidth?) Imagine one of those youtube ads with smiling
millennials pretending to run a business by swiping right on their cell
phone.

Point (5) is a HUGE GIANT BIG DEAL. I've spent years pumping data through
the atomspace and it's like sitting in front of a laundry machine watching
your clothes spin round. It's boring as heck, but if you don't watch,
something goes wrong or you forget what you're doing or you can't find the
results from last week because you forgot where you put them. This is a
major issue. I slog through this with a 100% manual process because I'm a
masochist, but it is not a tenable state of affairs.

And more or less everyone is ignoring this, although MOZI does have
assorted GUI's and dashboards for genomics data.

-- linas

--
Verbogeny is one of the pleasurettes of a creatific thinkerizer.
--Peter da Silva

--
You received this message because you are subscribed to the Google Groups
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/opencog/CAHrUA373AXOXpv_YNASvZa6W1qwCpKPFm80Zc-Dr0gkC1pCE7w%40mail.gmail.com.

Re: [opencog-dev] Distributed Atomspace

Reply via email to