I didnt take it personally, no need to apologize. I enjoy the more relaxed style, and often try for the same. Anyway, you're the expert here, and I should be disregarded if I'm speaking nonsense, but I'll make one more point in an attempt to convince you that I'm not:
If the unit of distribution is a a chunk, i.e., an Atom and the Atoms that make up it's outgoing set, then your average storage size is (12bytes * average num outgoing^depth) and that rapidly gets to the point where the serialization overhead becomes a small fraction of the whole chunk-record. I remember when protobuf over ZeroMQ was the toy of the month, and now that I understand those technologies better, I'm sure protobuf is a terrible idea, but I don't necessarily see anything wrong with using ZeroMQ to pass around 12 byte messages or 12n^d byte messages. In fact, the ML system that I mentiond earlier, which used Scylla/Cassandra as it's distributed feature store used ZeroMQ for communication between it's dozens of workers. I also know that the creator of ZeroMQ has moved on to a new messaging project intended to replace zmq, but I don't recall the name of the new one yet. Anyway, I guess my point here is just to encourage you not to discard technologies like ZeroMQ or NoSQL databases too quickly, just because someone failed to created a successful implementation in the past. Anyway, despite my off hand response to Ben, I'm not convinced replacing the current Postgres Backend with an interface to Cassandra would solve all the problems. But I am convinced that a token-ring-like architecture with tuneable consistency and topology-aware peer discovery and bloom-filter caching is essential for the kind of distributed data sharing performance you want, assuming a data set larger than can possibly fit in memory on a single machine. I doubt it can be done with a DHT alone. Distributed data processing of any kind is hard to get right. There are reasons we get a new unicorn start-up in this space every 6 months, and why most of them are on life support about a year later. (Cloudera. I predict Databricks reign will end too. Confluence may be on the rise now, but we'll see) Also, keep in mind that you'll never get anything like the performance of loading from disk once you're dealing with other machines over a network. You're trading that performance for the ability to work with data larger than you can fit on your own machine. So comparisons to loading from a static file are unfair, although at least were finally talking about concrete numbers that can set the goal to reach for. Will anybody out there fund my Distributed Data start-up if I can build a PoC that loads an atomspace from Kafka as fast as you can load from your ASCII files? ;-) Best, Matt On Wed, Jul 29, 2020, 5:50 PM Linas Vepstas <[email protected]> wrote: > > > On Wed, Jul 29, 2020 at 6:45 PM Matt Chapman <[email protected]> > wrote: > >> >> If you think this is what I'm saying by describing Cassandra's >> > > Sorry, it was not meant to be a jab at you ... over the last decade, > something like a dozen different databases have been proposed, each with > different reasons for using them. As I recall -- "nosql databases" -- BASE > not ACID -- so we tried memdb (couchdb(?) was recommended). The bitter > lesson was that it was optimized for 100MByte mp3's and 1MByte gifs and had > a throughput of about 100 atoms/second. The memdb developers couldn't care > less - "what kind of moron stores 12 bytes in a database?" was the general > reaction. > > Then there was the neo4j work. The lesson there was that 95% of CPU was > spent converting atoms into ZeroMQ packets (using google protocol buffers, > if I recall) and RESTful API's written in python using python decorators > ... lord knows how much CPU in neo4j itself unpacking the packets. Again, > I think this was also about 100 Atoms/second ... This is when the idea of > chunks and chunking started getting discussed, since obviously things could > run faster if we could ship thousands of atoms over at a time. Or maybe if > we could get neo4j to do the pattern matching, and ship back only the > results. How do you send a pattern-matcher query to neo4j? > > By comparison, the current ASCII-file-reader for reading Atoms in > s-expression format does about 100K atoms/second (that's on my machine ... > I'm told that the latest Apple laptops are maybe 5x faster?...) I actually > measured: about 45% of CPU time is spent doing string-compares and > string-copying and find-first-character-in-string and 55% of the cpu time > was in the atomspace, actually adding Atoms. Or maybe it was 55/45 the > other way around. I forget. > > I do have extensive notes on atomspace performance in > https://github.com/opencog/benchmark/ - on my machine, raw atomspace is > 700K nodes/sec and 200K links/sec so maybe a million/sec on something > modern. Running at 100 atoms/sec through some RESTful/zero-mq/whatever > interface is embarrassing. > > I'm writing in this flippant style because I'm trying to make it fun to > read my emails. There's a serious lesson here: converting things that are > 12 bytes long into other things has just a huge overhead. I'm not sure how > c++ std::string is implemented -- how many cpu cycles it takes to compare a > byte, add one and go to the next byte ... but if you do anything much more > complicated than that, you pay a performance penalty. This is where the > performance bar is set. It's hard to figure out how to jump over that bar. > Or even get near it. > > -- Linas > > -- > Verbogeny is one of the pleasurettes of a creatific thinkerizer. > --Peter da Silva > > -- > You received this message because you are subscribed to the Google Groups > "opencog" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/opencog/CAHrUA35WRUonm82pMLDXqgqS7oV339o7KjTDQg4o_gWQJnE7Bw%40mail.gmail.com > <https://groups.google.com/d/msgid/opencog/CAHrUA35WRUonm82pMLDXqgqS7oV339o7KjTDQg4o_gWQJnE7Bw%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAPE4pjCNXnMgba6e6xET4x6uMXpytNykB_h7wkmv27Zpbs%3DrSQ%40mail.gmail.com.
