Hi Ben, On Thu, Apr 22, 2021 at 5:33 PM Ben Goertzel <[email protected]> wrote:
> > If... well I can imagine that there are scenarios where having a DAS is > useful .. but we don't have working code for any of those. The problem of > building a technology like DAS for some non-existent users who might show > up in the future... well, you might discover that the DAS is built wrong. > It might have the wrong performance profile. It might be too clunky. It > might have lots of features you don't need, and be missing features you > do. Designing and building things that don't have a current use is a very > risky business. > > > The bio-Atomspace we are experimenting with now contains only a small > % of the biomedical knowledge we would like it to, which is because of > RAM and processing speed limitations in current OpenCog > > Recent optimizations help but don't remotely come close to solving the > problem > OK. Well, that's news to me. I try to keep everyone happy, and when there aren't any comments or complaints, I assume everyone is happy. Do you have actual examples, where you are running out of RAM, and where things are going too slow? Or is this just a gut-feel issue, for which you have no actual data? I cannot repeat this often enough or strongly enough: the kinds of optimizations that are performed on software systems are extremely data-dependent and algorithm dependent. It is effectively impossible to perform optimizations without having a specific use case. This is a kind-of theorem of computer science. > > The neural-symbolic grammar learning that Andres Suarez and I > prototyped last spring, also couldn't viably be done using OpenCog for > similar reasons (RAM and processing speed limitations). No one ever complained about RAM or processing speeds, so it's kind of unfair to just bring this up a year later. I had the impression that the theory you were developing wasn't working out; I wasn't surprised, but I never fully understood it. This spring, I restarted work on https://github.com/opencog/learn -- you can review the README for the current status. I get good results. Its a big project. Things go slowly. Not enough time in the day. > If we could > complete that work it would be very useful in our current humanoid > robotics work w/ OpenCog (Awakening.Health) > I would like to help, but I suspect that the direction I've been going in is not likely to match your requirements. A good solution might not arrive on the time-scale you want. The experimentation on pattern mining from inference histories for > automated inference control, that Nil was doing a year ago, was > incredibly slow also due to Atomspace limitations. Ben, that is also incredibly unfair. Never-ever did you or Nil or anyone else ever complain about "atomspace limitations". So you can't just start blaming it now. If there is an actual performance problem, open a github issue, and describe it. Provide instrumentation, bottlenecks. I watched those projects from afar, and ... well, all I can say is "that's not how I would have done it". The fact that you had performance problems is almost surely a statement about your algorithms, and not a statement about the atomspace. The atomspace is what it is, and if you use it incorrectly, you'll get disappointing results. It's not a magic wand. It's just software, like any other kind of software. > It is possibly true that for each such case, one could design a > specialized architecture to support just that case, working around the > need for a general-purpose DAS in that particular case.... > You are describing things that sound like (to me) inadequate or inappropriate algorithms, and then switching the topic to DAS. You don't have to use the atomspace -- you could have done the inference mining on any one of a half-dozen map-reduce platforms out there -- many of them from the Apache.org people -- and you would not have gotten performance that is any better than what the atomspace provides. These complaints are reminiscent of decades worth paper magazine articles (remember those?), blog-posts and marketing campaigns about big data, scale-up vs scale-out, data mining, machine learning, no-sql vs sql vs graph databases. There must be ten thousand white-papers and a hundred thousand pages on this stuff. Nothing that I saw Shujing doing with pattern mining was any different than what anyone else in the industry does when they data-mine. This is common, every-day stuff. Everyone in the industry experiences variations thereon. If it didn't work out for you, maybe you didn't have your nose to the grind-stone. The advantage that google enjoys (besides having more money) is that they can pair together PhD's who understand the problem, with engineers who understand how to make the code run fast. Either way, those salaries are huge, and they never have enough grunts to work on the visionary projects, and serendipity plays a big role, as well as good management. You can get lucky some of the time, you can't get lucky all of the time. > I understand we could proceed by writing > fully-working-except-for-scalability-issues code for all the above > applications I've alluded to (and more), and then analyzing all this > code and its specific scalability issues, and using this analysis to > drive design of an improved system... > I don't think that is what i'm saying. That's not really how it's done, when it's done in the industry. You could try that, but it's likely to fail. Why? For starters, I'm not convinced that the algorithms are correct. If you have a combinatoric explosion (which I am guessing is what is happening) then you have to address that first: you have several options: (a) you can try to mitigate it (b) you can hunt for alternative algorithms and data structures (c) redesign it for specialty hardware (e.g. GPU's) (d) ... No one ever builds "working except for scalability" code, and then "just scales it". That is not how it's done. > > Instead we are indeed aiming to proceed in a faster but in some senses > more risky way, by creating a design that appears to us capable of > scalably carrying out applications such as the above Given that I don't understand the "applications such as those above", I don't know how to respond. You would have to describe those applications in engineering terms, in order to understand how they could be implemented so as to run efficiently and scalably ... without an actual description of what it is, it's not a solvable problem. There's just an insufficient amount of detail. > > >> -- a piece of the distributed persistent Atomspace, say in RocksDB > >> > >> Then of course the AI process on machine X can query the local > >> Atomspace specifically as it wishes. If it wants to query the > >> persistent backing store, it can query RocksDB and it will get faster > >> response if the answer happens to be stored in the fragment of RocksDB > >> that is living on the same machine as it. There will be some bias > >> toward the portion of the distributed RocksDB on machine X having > >> Atoms that relate to the Atoms in the local Atomspace on machine X ... > >> but this depends on the inner workings of RocksDB. Or at least > >> that's how I'd think it would work, this is speculative... > > > > > > This is not quite how it works, but, roughly speaking, I've got > prototypes that do this today. > > What are the main differences btw what I described above and what your > prototypes do? > Rocks does not do sharding across the network. If you have different fragments of an atomspace dataset on 10 different networked machines, and you want to write a pattern match that will run across all of those machines in parallel, and join together the results, I could write that snippet of code in the proverbial afternoon. It's so simple, in fact, that it could be written as an example, to add to the set of examples. (actually, I think one of the examples already does this, more or less. Actually, its a mashup of these two demos: https://github.com/opencog/atomspace/blob/master/examples/atomspace/persist-multi.scm and https://github.com/opencog/atomspace/blob/master/examples/atomspace/persist-query.scm ) That's not the hard part. The hard part is a ladder of requirements: * How do you get the shards of data onto those machines? Do you use rsync to copy files, or do you want to send them via atomspace? If you use rsync, then where will you keep the script for it? * Where do you keep the list of the currently-active set of 10 machines? Do you need a GUI for that? A phone app? * What do you do if one or more of them hasn't booted, or has crashed? * Are they password-protected? The atomspace is not password-protected! There are atomspace issues: * The simplest solution is to wait until all ten have returned results, and then join them together. * Another possibility is to let the results dribble in, and join them as they arrive. This is more complex, and requires more sophistication. The 10-line demo program now becomes a 100 or 200 line program. * What if one of the machines has crashed during processing? e.g. bad network card, failed disk, power outage? * Perhaps you want to load-balance, so that the slowest machine is not always the bottleneck. This requires measuring each machine to see if it is idle or not, and giving it more work if it is idle. This is non-trivial. Most engineers would do this outside of the atomspace, but you could also do it inside the atomspace if you write custom Atoms for it. Does your design require custom Atoms for load-balancing? * Perhaps the dataset is badly sharded, so that one of the machines is always a bottleneck. This requires not only finding the busiest machine, but then re-sharding the data. Many databases do this automatically. The conventional way in which this is done is to find a sequence of "least cruel cuts" in the Tononi sense, and move those to other machines. Find the cuts that hurt Phi the least. Talking about phi is fancy-pants buzzword-slinging, but all the people who do data-mining have a very intuitive understanding of Tononi's Phi, and have had that understanding many, many decades ago, because it's key to both software and hardware optimization. This is easy to say, but finding those cuts is hard to do. Nothing in opencog today does this automatically. However, I can imagine several possible solutions, ranging from real easy ones to really complex ones, each having pros and cons. Vendors like Oracle have had solutions for this, for decades. They've invested hundreds of man-years into it. * There's more. I wanted to mention concepts like "explain vacuum analyze" and "query planning" but perhaps some other day. Everyone gets to solve the query planning problem, including Hyperon. There's no free lunch. Then there are the data-design issues and meta-issues * Perhaps you are storing data as atoms, that should have been Values. Values are a lot faster than Atoms, but they get this performance with a set of function trade-offs. * Perhaps your data should not be kept in the atomspace at all. This includes audio, video live-streams, text files, medical records, and a zillion other data types. * Perhaps you want to run SciPy on text summarizers, or write tensorflow algos. There are 1001 software platforms that are tuned for stuff like that. Use the tool that is appropriate. > Your help on the bio-Ai project is much appreciated! However, I would > note that Mike Duncan doesn't have the facility w/ tuning and fixing > Atomspace/OpenCog code Mike is the one who asked for help. I worked with him as the manager in charge of the project, and not as the technical contact. 90% of the code was written by Habush, and he is the one I interacted with regularly, and a little bit with Hedra Seid. I like Habush. He enjoys exploring brand new shiny things. > that Vitaly and some of his St. Petersburg > colleagues do.... Again, this code is not a part of the atomspace, its not even a part of opencog. Vitaly & friends have never looked at it, never touched it. > I would venture there are less likely to be > relatively quick fixes to apparent brick walls that Vitaly etc. run up > against. But I'd be happy to be refuted by reality on this -- we will > be using Original OpenCog in Awakening.Health for quite some time and > so having it work better and better is definitely valuable to us... > None of the agi-bio code is a part of opencog. That is not the code that got fixed! > Hmm, at a high level we did guess a pattern cache was going to be > useful -- and Senna implemented one some time ago. The concept of a cache is about as generic as the concept of a loop or an if statement. You are effectively saying that Senna thought of using if-statements and loops, and implemented a program that used them. That's just crazy-making! It's a shame I have to end the email like this, but this is the kind of mis-communication that we habitually engage in. I find it difficult, at times. The alternative is to ignore these kinds of comments, but that's not terribly productive, either. -- Linas -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA36_m4U0c1fveXbpjLC8tzcuKpYeqO1PF8R4qHAMifHMZQ%40mail.gmail.com.
