Re: [opencog-dev] Distributed Atomspace

Linas Vepstas Wed, 29 Jul 2020 20:16:24 -0700

Hi Matt,

zero-mq is nice, if you use it correctly. The meta message is that some
subsystems in the atomspace have reached the point where one can count
bytes and cycles, and it's not obvious how to shave anything more off. The
other message is that it's real easy to pointlessly waste cycles converting
atoms into packets of various kinds to ship them around.


The biggest message (I'm starting to think) is that placing atoms into a
database, any database at all is pointless and useless. You can't actually
"do anything" with atoms that are in a database; they are useful only when
you can crawl over them (i.e. with the pattern engine or with PLN or
as-moses, or pattern miner, etc.).

And this is key: I can't dump atoms into cloudera (or databricks) and run
data-analytics on them. That doesn't make sense.

So the only usefulness left in having a database is treating it like an
archive manager. so, e.g. dump a million atoms into a text file or a binary
blob, stick a name, date and author on it, and archive it.  So I guess what
I really want is an archive manager. Like the lowly file manager on your
pee cee. Or your photo-album manager. Except I don't need any database at
all to do that -- and I can just stick them on bit-torrent to make them
"distributed" -- or IPFS, or dat:// if you want to be awesomely modern.

So it seems like what I need is some way of saying "find me the archive of
genomic data that has covid-19 in it" and maybe "if it's too big find me a
smaller chunk with only the covid and not the reactome data".   And slap a
real slick GUI on this ... seriously, like some cell-phone app GUI.

There's still a hole in the above: I can hear someone saying "but what if I
have a trillion atoms, and I need to search all of them, that would require
a petabyte of RAM!" and .. well, I have some ideas for that, (and so do
others on this mailing list) but none of those ideas require a database of
any sort!  The database "gets in the way", as it were.  It provides very
nearly zero added value (except in a few corner cases...) In particular,
all existing data analytics systems seem to be 100% useless for doing
anything with atoms... !?! 😮

It feels like .. all these years, we've been imagining the wrong
solution...

We need a plain-old archive manager, for now, and a chunk manager once we
figure out what chunks are...

And by archive manager I'm thinking, like -- click a button and that
launches an atomspace, loads it with archive 42, and then starts pln or
whatever crawling on it... so maybe like the MOZI GUI's which I have never
used....

--linas


On Wed, Jul 29, 2020 at 8:25 PM Matt Chapman <[email protected]> wrote:

> I didnt take it personally, no need to apologize. I enjoy the more relaxed
> style, and often try for the same. Anyway, you're the expert here, and I
> should be disregarded if I'm speaking nonsense, but I'll make one more
> point in an attempt to convince you that I'm not:
>
> If the unit of distribution is a a chunk, i.e., an Atom and the Atoms that
> make up it's outgoing set, then your average storage size is  (12bytes *
> average num outgoing^depth) and that rapidly gets to the point where the
> serialization overhead becomes a small fraction of the whole chunk-record.
>
> I remember when protobuf over ZeroMQ was the toy of the month, and now
> that I understand those technologies better, I'm sure protobuf is a
> terrible idea, but I don't necessarily see anything wrong with using ZeroMQ
> to pass around 12 byte messages or 12n^d byte messages. In fact, the ML
> system that I mentiond earlier, which used Scylla/Cassandra as it's
> distributed feature store used ZeroMQ for communication between it's dozens
> of workers.
>
> I also know that the creator of ZeroMQ has moved on to a new messaging
> project intended to replace zmq, but I don't recall the name of the new one
> yet.
>
> Anyway, I guess my point here is just to encourage you not to discard
> technologies like ZeroMQ or NoSQL databases too quickly, just because
> someone failed to created a successful implementation in the past.
>
> Anyway, despite my off hand response to Ben, I'm not convinced replacing
> the current Postgres Backend with an interface to Cassandra would solve all
> the problems. But I am convinced that a token-ring-like architecture with
> tuneable consistency and topology-aware peer discovery and bloom-filter
> caching is essential for the kind of distributed data sharing performance
> you want, assuming a data set larger than can possibly fit in memory on a
> single machine. I doubt it can be done with a DHT alone.
>
> Distributed data processing of any kind is hard to get right. There are
> reasons we get a new unicorn start-up in this space every 6 months, and why
> most of them are on life support about a year later. (Cloudera. I predict
> Databricks reign will end too. Confluence may be on the rise now, but we'll
> see)
>
> Also, keep in mind that you'll never get anything like the performance of
> loading from disk once you're dealing with other machines over a network.
> You're trading that performance for the ability to work with data larger
> than you can fit on your own machine. So comparisons to loading from a
> static file are unfair, although at least were finally talking about
> concrete numbers that can set the goal to reach for.
>
> Will anybody out there fund my Distributed Data start-up if I can build a
> PoC that loads an atomspace from Kafka as fast as you can load from your
> ASCII files? ;-)
>
> Best,
>
> Matt
>
>
>
>
>
> On Wed, Jul 29, 2020, 5:50 PM Linas Vepstas <[email protected]>
> wrote:
>
>>
>>
>> On Wed, Jul 29, 2020 at 6:45 PM Matt Chapman <[email protected]>
>> wrote:
>>
>>>
>>> If you think this is what I'm saying by describing Cassandra's
>>>
>>
>> Sorry, it was not meant to be a jab at you ... over the last decade,
>> something like a dozen different databases have been proposed, each with
>> different reasons for using them. As I recall -- "nosql databases" -- BASE
>> not ACID -- so we tried memdb (couchdb(?) was recommended). The bitter
>> lesson was that it was optimized for 100MByte mp3's and 1MByte gifs and had
>> a throughput of about 100 atoms/second. The memdb developers couldn't care
>> less - "what kind of moron stores 12 bytes in a database?" was the general
>> reaction.
>>
>> Then there was the neo4j work. The lesson there was that 95% of CPU was
>> spent converting atoms into ZeroMQ packets (using google protocol buffers,
>> if I recall) and RESTful API's written in python using python decorators
>> ... lord knows how much CPU in neo4j itself unpacking the packets.  Again,
>> I think this was also about 100 Atoms/second ... This is when the idea of
>> chunks and chunking started getting discussed, since obviously things could
>> run faster if we could ship thousands of atoms over at a time. Or maybe if
>> we could get neo4j to do the pattern matching, and ship back only the
>> results. How do you send a pattern-matcher query to neo4j?
>>
>> By comparison, the current ASCII-file-reader for reading Atoms in
>> s-expression format does about 100K atoms/second  (that's on my machine ...
>> I'm told that the latest Apple laptops are maybe 5x faster?...) I actually
>> measured: about 45% of CPU time is spent doing string-compares and
>> string-copying and find-first-character-in-string and 55% of the cpu time
>> was in the atomspace, actually adding Atoms. Or maybe it was 55/45 the
>> other way around. I forget.
>>
>> I do have extensive notes on atomspace performance in
>> https://github.com/opencog/benchmark/ - on my machine, raw atomspace is
>> 700K nodes/sec and 200K links/sec so maybe a million/sec on something
>> modern. Running at 100 atoms/sec through some RESTful/zero-mq/whatever
>> interface is embarrassing.
>>
>> I'm writing in this flippant style because I'm trying to make it fun to
>> read my emails. There's a serious lesson here: converting things that are
>> 12 bytes long into other things has just a huge overhead. I'm not sure how
>> c++ std::string is implemented -- how many cpu cycles it takes to compare a
>> byte, add one and go to the next byte ... but if you do anything much more
>> complicated than that, you pay a performance penalty.  This is where the
>> performance bar is set. It's hard to figure out how to jump over that bar.
>> Or even get near it.
>>
>> -- Linas
>>
>> --
>> Verbogeny is one of the pleasurettes of a creatific thinkerizer.
>>         --Peter da Silva
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "opencog" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/opencog/CAHrUA35WRUonm82pMLDXqgqS7oV339o7KjTDQg4o_gWQJnE7Bw%40mail.gmail.com
>> <https://groups.google.com/d/msgid/opencog/CAHrUA35WRUonm82pMLDXqgqS7oV339o7KjTDQg4o_gWQJnE7Bw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/opencog/CAPE4pjCNXnMgba6e6xET4x6uMXpytNykB_h7wkmv27Zpbs%3DrSQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/opencog/CAPE4pjCNXnMgba6e6xET4x6uMXpytNykB_h7wkmv27Zpbs%3DrSQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>


-- 
Verbogeny is one of the pleasurettes of a creatific thinkerizer.
        --Peter da Silva

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA372bjBBuPYqH-3fy6VW81L%2BM5DGK%2BMaBEO0Ch_TiLbVbw%40mail.gmail.com.

Re: [opencog-dev] Distributed Atomspace

Reply via email to