On Mon, May 28, 2018 at 8:27 AM, Alexey Potapov <[email protected]> wrote:
> Hi. > Here are some additional thoughts on OpenCog+PPL, which I didn't include > in the first message > > *OpenCog + PPL* > > > > One of the ideas/tasks was to make an OpenCog a “better Church”. > Oooof. I wrote a long diatribe about religion, until I realized you must be referring to something that Alonzo Church must have written or invented. I am not that familiar with his work, so I don't know what a "better Church" would be. Is there some specific paper or book? > > > *1. **Why is it possible* > > Universal probabilistic programming languages (PPLs) utilize > sampling-based approaches to infer posterior probabilities. They do not > “reason” about tasks. > > Consider a number of examples. > > 1.1) > > (rejection-query > > (define n (random-integer 1000000)) > > n > > (= n 10)) > > > > This program says that *n* is a random integer number from 0 to 999999, > and it tries to estimate P(*n*|*n*=10). It will take quite long also it > is obvious that P(*n*=10|*n*=10)=1. We can imagine an (extended) Pattern > Matcher easily deducing *n*=10 and checking it fits [0, 999999]. > Or you could apply analytic combinatorics to obtain this result. Philippe Flajolet has a 600 or 800 page-long book, showing how to do this. Its called "analytic combinatorics". > > > 1.2) > > (rejection-query > > (define n (random-integer 10)) > > (define m (+ n 5)) > > n > > (= m 0)) > > > > Here, rejection-query will hang forever, since the condition cannot be > met. Again, the absence of the answer can be easily deduced (given some > properties of +). > Yep. > > > 1.3) Einstein's (zebra) puzzle can be easily represented in Church > > (define where list-index) > > (define what list-ref) > > (define (is? xs x ys y) > > (eq? (what ys (where xs x)) y)) > > (define (neighbour? x y) > > (or (= x (+ y 1)) > > (= x (- y 1)))) > > (rejection-query > > (define colors (shuffle '(Yellow Blue Red Ivory Green))) > > (define nationalities (shuffle '(Norwegian Ukrainian Englishman Spaniard > Japanese))) > > (define drinks (shuffle '(Water Tea Milk OrangeJuice Coffee))) > > (define smokes (shuffle '(Kools Chesterfield OldGold LuckyStrike > Parliament))) > > (define pets (shuffle '(Fox Horse Snails Dog Zebra))) > > > > (what nationalities (where drinks 'Water)) > > > > (and (is? nationalities 'Englishman colors 'Red) > > (is? nationalities 'Spaniard pets 'Dog) > > (is? drinks 'Coffee colors 'Green) > > (is? nationalities 'Ukrainian drinks 'Tea) > > (= (where colors 'Green) (+ (where colors 'Ivory) 1)) > > (is? smokes 'OldGold pets 'Snails) > > (is? smokes 'Kools colors 'Yellow) > > (= (where drinks 'Milk) 2) > > (= (where nationalities 'Norwegian) 0) > > (neighbour? (where smokes 'Chesterfield) (where pets 'Fox)) > > (neighbour? (where smokes 'Kools) (where pets 'Horse)) > > (is? smokes 'LuckyStrike drinks 'OrangeJuice) > > (is? nationalities 'Japanese smokes 'Parliament) > > (neighbour? (where nationalities 'Norwegian) (where colors 'Blue)) > > ) > > ) > > > > Again, blind search performed by PPLs is too inefficient here. > So why bring it up? > OpenCog can solve this problem quite efficiently, > Well, that depends on how the problem is coded, representationally. If it is coded as a graph search, then there is a combinatoric explosion -- there are 5^5=3125 branches, and it takes a long time to explore each branch. Due to representational difficulties in atomese, I think the representation used in the unit test has far more branches, using more variables than might otherwise be needed (e.g. for neighbors, it uses unordered links, which have 5! permutations. So there are some factors of 5! in that unit test - which is due to an over-simplified representation.) The classic, efficient solution to this problem is to use a 5^5 tensor, and knock out the tensor rows and columns, which is efficient, but is not obvious how to represent this, code this up in atomese. I'm not saying its impossible, only that its not obvious: for a human, it remains a difficult task to write the code that solves this efficiently. Constraint-satisfaction programming languages do make it easy to specify these kinds of problems: viz, it is probably quite easy to write down this problem using ASP. However, the current opencog rule engine does not use the ASP syntax, (or any other constraint-satisfaction syntax that I know of) and thus is not an efficient constraint-satisfaction solver. > and we can imagine that an equivalent probabilistic program is written in > Atomese and URE deductively infers the answer starting from constraints > without actually sampling random variables. > > > > We can see that from PPLs perspective, OpenCog can be used to make > inference over probabilistic programs much more efficient (at least, in > some cases). Why not write these programs in Atomese directly? For what > reason do we need a PPL metaphor? > Well, for several reasons. First, Atomese is a representational language, it is NOT a compute platform. This is important to understand: Atomese is designed to represent things, say things, to provide a syntax that gives meaning to ideas. Atomese is NOT a compute platform (OK, so, de facto, it can compute, because it does have an existing implementation; this implementation is... inadequate in many ways). The goal of representation is so that syntactic transformations can be applied to the network-structures. That is, if one has a structure of some particular shape, then it can be transformed to another structure, in any one of several different ways. In wikipedia, these structural transformations are called "rewrite rules". In model theory, these structural transformations are called "a theory". In proof theory, these structural transformations are called "rules of inference". In gcc (the compiler) these structural transformations are called "gimple". LLVM and JVM and C# have yet another names for them. All of these different names really describe the same general concept: the syntactic (homotopic) transformations between shapes. (thus homotopy type theory) Atomese is a device for representing shapes, (and thus, implicitly, their semantic content), and the syntactic transformations that can be applied to the shapes. It has been convenient to also perform computations "under the covers", sort-of hidden away, in C++ classes. So, in this sense, Atomese is "accidentally" a computational platform. However, it is terribly inefficient. If you compare Atomese to, say the Java bytcode, the C# bytecode, or the guile bytecode, it is literally thousands, tens-of-thousands of times slower. (but not a million; its not that slow.) The point here is that Atomese is a representation. What is done under the covers is the implementation, and that is quite a different thing. The current implementation is what it is. It can be improved, in a homotopic, continuous fashion, but that will take some hard work by systems programmers. The URE is a different beast. The original intent was that, since our data consists of graphical structures, and a basic single-step graph-rewriting primitive (the pattern matcher), that we should now build a system that can apply multiple syntactic transformations to the knowledge graph. From this comes the concept of "chaining": one can apply one transformation after another after another, in a linear sequence or "chain". This then splits into forward and backward chaining. And this idea got implemented and is called the URE. It was a worthwhile experiment, but, in the end, perhaps too simplistic and naive in conception. There are better ways. By the 1980's, academia knew that chaining is fundamentally inefficient, and that there are several ways to do better. One way is to approach the problem as a "parsing problem": instead of chaining, one asks, how can one select a set of rules, and assemble them (as "puzzle pieces") in such a way as to provide an extended transformation? Once you start thinking of it that way, you can ask, "what are the efficient parse algorithms?" The second approach to this problem is to realize that parsing is a constraint-satisfaction problem, and so one asks, "what are the efficient constraint-satisfaction solvers?" Continuing in that direction leads down the road towards SAT solvers. Thus, in the future, I would like to see URE-II which assembles rules as if it were solving a parsing or constraint-satisfaction problem. This would be much much more efficient. The syntax of the rules would have to change; a side effect of working with the URE has been the discovery that "variables are harmful" - we use variables for connectors, and this requires the concept of quoting, which turns out to be a major mistake. Live and learn. Whatever. OK, so now on to your question "what do we need probabilistic programming languages for?" To answer this, I first have to explain where probability comes in. The graphs, above, and the syntactic transformation rules being applied to them: they do NOT come uniformly distributed. Even if they did, after a few short turns of the crank, the distribution would no longer be uniform. There are two ways to see this: pattern generation, and pattern recognition. In pattern generation, one applies a sequence of rules, and obtains fractals -- clearly, very highly non-uniform in distribution. More abstractly, there might be dozens or hundreds of different rule-applications that all lead to the same end-point (and thus have a large measure or "probability" assigned to that endpoint). Other endpoints may be very "thin", with only one, single rule-sequence that leads to that endpoint. (and thus that endpoint has a small measure or "probability"). In pattern-recognition (e.g. language recognition by finite automata or stack automata, as a basic example; or in natural-language-parsing as a more sophisticated example) - in pattern recognition, some rules are used more frequently than others, and thus "should be tried first". In natural language, where there are ambiguities, there is a correlation between the "correct parse", or, at least, the "most likely intended parse", and the rules used to obtain it. There is a likelihood that can be assigned to rules and rule combinations. So, in light of this, what should one do? There are two answers. One is the answer that Ben proposed some decade ago or more, with "probabilistic logic networks" -- Assign probabilities to knowledge-representations, assign probabilities to transformational rules, and use certain, specific formulas (the "PLN formulas") to attempt to track or emulate or approximate the true flow of probability through the system. There are several problems with this: it is not prima facie obvious that the PLN formulas are correct. I guess they offer some reasonable approximation of probability flow, but it seems to be invented from thin air, a decent rule-of-thumb, but that's all. A second problem is that there is a combinatorial explosion during rule application -- this is rooted in the combinatorial explosion in rule chaining (backward or forward). If one instead replaced the chainers by constraint solvers, how would one, could one apply the PLN formulas? Its not clear. So, the idea is that probabilistic sampling, via probabilistic programming techniques, offers a different mechanism for obtaining the probabilities. This is best illustrated by example. Suppose, in the Einstein problem, that instead, we had that one person had a French mother and a Dutch father, and thus was half-French, half Dutch. Suppose the tea-drinker also drank coffee one morning a week. Suppose the Kools smoker preferred Kools, unless he ran out, and then smoked OldGold. And so on. At this point, the puzzle will no longer have a unique solution, but will have several solutions, some with high probability, and thus likely; while others are merely "plausible": consistent with the facts, but unlikely. There are also still a very large number of impossible assignments. NOW how does one try to solve the puzzle? One answer is "use PLN". Another answer is "use probabilistic programming". Here, it seems possible that random sampling will be faster than the recursive application of rules and probability propagation via rule-formulas. It also seems that it will be more accurate. Again: the PLN fomulas are ad hoc: they were not derived in a rigorous way (e.g. via "analytic combinatorics", which gives exact results). Applying analytic combinatorics to the (probabilistic) Einstein problem is difficult. Thus, frequentist sampling, using probabilistic programming will probably be not only faster, but also far more accurate. There is still one fairly large problem with probabilistic programming; this goes back to the "Medieval reasoning" examples I gave earlier (which I wrote up in a blog entry: https://blog.opencog.org/2018/05/27/medieval-thinking-and-political-agi/ ) Probabilistic programming is effectively a sampling technique; it does not reveal cause and effect. That is, it will generate plausible answers (sure, the kools smoker might be one-fourth-Greek, which is compatible with orange juice drinking), and even the likeliest answer (the one with the highest weight assignments) But it does not provide a "model of the world" or a "case for the prosecution", the way that a (Medieval Scholastic... or modern) legal system might. That is, in a court of law, the statement that "the kools smoker might drink orange juice" is tantamount to saying "the accused had time to drive there, kill the victim and drive back": it is not a probability assignment, but rather a ruling out of counter-factuals. I do not know how to use probabilistic programming to rule out counter-factuals. Just because, out of 1000 samples, some branch was not taken, that does not mean that the branch will not be taken on the 1001st sample. That is, probabilistic programing provides a way for estimating probabilities. For many cases, it may be faster, and provide more accurate answers, than using PLN and the URE rule chainer. However, it does not provide any mechanism for counter-factual reasoning. It does not provide a mechanism to build competing "models of the world" (viz a model where the accused is innocent, another where the accused is guilty) Model of the world: to be clear: the Einstein puzzle allows one to build court cases, and make accusations: "the prosecution asserts that the Norwegian drinks juice!" which may be true or false, but where did the Einstein puzzle come from? How do we even know that we should be solving the Einstein puzzle, and not some other puzzle? Why are we talking about Norwegian juice drinkers instead of school shooters and #metoo victims? Building world models requires pattern discovery. -------- I don't know if you followed all of that, or if you found it to be too abstract or too hand-wavey. There is a very specific, narrowly-focused place where I think that probabilistic programming could be immediately and directly useful, and for which it might be mostly straight-forward to write the code -- this would be for a sampling replacement for PLN. The algo goes like this: For each input fact, assign an a-priori probability. Next, sample: assign each input fact a crisp truth value of T or F (according to the probability) and perform crisp-logic reasoning, propagating these crisp truth values. Arrive at the conclusions(s), and count. This gives a frequentist sampling result for the final probability. Several interesting things happen. (1) it might be faster to do 1000 crisp-logic samples, than to use PLN (2) it validates the PLN formulas (or not) - the frequentist answers should be assumed to be "correct" -- when does PLN give answers with the frequentist results, and when does PLN diverge? If it diverges, then why? I think that this is a sufficiently concrete, down-to-earth, practical problem that it can be tackled in some reasonable time-frame. It does not require any of the loosey-goosey abstract nonsense I was talking about before. > > > *2. **What is missing* > > Consider the following example. > > 2.1) > > (rejection-query > > (define x (gaussian 0 1)) > > (define y (gaussian 0 1)) > > x > > (= (+ x y) 1)) > > > > It will actually not work in Church, because the strict condition will > not be satisfied (but it can be modified with the use of soft equality). > URE (given necessary axioms) can easily infer *y*=1–*x*. However, it will > not be able to ground variables. Actually, these are not variables to be > grounded using number nodes, but rather values should be assigned to them. > Also, we shouldn’t just sample *x* and then calculate *y*=1–*x*, because > different values of *y* have different prior probability. Thus, if we > want to estimate posterior probabilities, we should add specific mechanisms > of taking prior probabilities of inferred values into account. Of course, > we will also need some basic random distributions. > Again, URE is not very efficient. Also, the "rules" would have to be obtained via analytic combinatorics, and this is not easy. If it were easy, there would not be a 600-page book on it. It is not practical for humans to spend time trying to hand-write rules, anyway -- they whole point is to automate the learning of rules, from nothing, rather than to have humans write the rules. But the learning of rules is another, different abstract discussion. > > > 2.2) Fitting a polynomial with an unknown degree. > > (define (calc-poly x ws) > > (if (null? ws) 0 > > (+ (car ws) (* x (calc-poly x (cdr ws)))))) > > (define (calc-poly-noise x ws sigma) > > (+ (calc-poly x ws) (gaussian 0 sigma))) > > (define (generate xs ws sigma) > > (map (lambda (x) (calc-poly-noise x ws sigma)) xs)) > > (define (sum-dif2 xs ys) > > (if (null? xs) 0 > > (+ (+ (* (- (car xs) (car ys)) (- (car xs) (car ys)))) > > (sum-dif2 (cdr xs) (cdr ys))))) > > (define (samples xs ys) > > (mh-query 10 10 > > (define degree (sample-integer 4)) > > (define ws (repeat (+ 1 degree) (lambda () (gaussian 0 3)))) > > (define sigma (gamma 1 2)) > > degree > > (< (sum-dif2 (generate xs ws sigma) ys) 0.5))) > > > > (define xs '(0 1 2 3)) > > (define ys (generate xs '(0.1 1 2) 0.01)) > > (hist (samples xs ys) "degree") > > > > This program actually works and finds a correct degree of the polynomial > using Bayesian Occam razor that is available “for free” in PPLs. > I did not take the effort to try to decipher the program up above. Our traditional answer for these kinds of problems is "use MOSES". I did spend some time in working on a framework that would allow MOSES to solve differential equations (that is, given a (noisy) time series of data, automatically discover the the differential equation that describes that motion. It seems doable and even reasonably efficient. (Viz, newton differences, "umbral calculus" effectively means that polynomials and differential equations are "the same thing") For some things (such as discovering differential equations in motion data) it seems semi-reasonable to have human beings write some special-purpose, custom software. But in general, this runs counter to the automatic discovery of patterns. The above example is a large complex program that is hard to write, hard to understand. We cannot require humans to write large complex programs, every time some new problem comes up. Pattern discovery needs to be automated. > Here, Atomese appears to be rather useless (constraints cannot be > deductively propagated/pattern-matched, and there is no set of atoms to > which variables can be grounded). > Well, careful in understanding what "Atomese" is. Again: -- Atomese is a knowledge representation system. In so far as the above is some snippet of knowledge, it can be represented in Atomese. -- If you are thinking of the pattern matcher, and of using forward-chaining to solve the problem of fitting a polynomial to a sequence of data points, I think this is doable; you'd have to create a set of rules defining least-squares fitting. One of the example demo programs in the examples directory shows how to create a finite probabilistic state machine in Atomese. Another demo extends this into a min-quasi-crippled hidden-markov-model type system. In my imagination, least-squares polynomial fitting could be done the same way, too. It seems like a reasonable "homework problem" for someone. But again, having humans write least-square fitting programs in Atomese is kind-of opposite to the general direction I want to go in. It *is* useful to try to solve different kinds, different classes of problems in Atomese, such as polynomial fitting, or bayesian networks or whatever -- these are all worthwhile experiments -- it highlights, indicates where the hard parts are, where the easy parts are, what the short-comings are, what is good and what is bad, what needs to be rethought, re-designed. If you are thinking of the URE -- then yes, in retrospect, the concept of Variables is clearly a mistake. It was a reasonable mistake to make, but still a mistake. It is an example of an experiment in Atomese. We learned something from it. > Small modifications of Pattern Matcher might be enough (e.g. introducing > RandomVariableNode and sampling its value from a given distribution instead > of systematically enumerating all possible groundings), > We already have that; except the name is different. Two of them, actually: RandomNumberLink, and RandomChoiceLink https://wiki.opencog.org/w/RandomNumberLink https://wiki.opencog.org/w/RandomChoiceLink These were originally implemented for doing a probabilistic-programming approach to controlling the Hanson Robotics Sophia robot. We do NOT have a CountVariableNode that could accumulate counts of how often some particular event/branch/clause was explored. Currently we accumulate frequentist counts in two major OpenCog subsystems: one is in the Pattern Miner, the other is in the natural-language-learning subsystem. Both accumulate the counts in an ad-hoc way, outside of atomese. It might be (should be?!!) interesting to accumulate counts inside of Atomese proper, in a principled way. However, this is easier said than done. For example, in language learning, first we observe pairs of words, and increment counts - not done in Atomese, but could be, and its not that hard. But the second step is to perform an "MST parse", create disjuncts and accumulate statistics on those. Neither the MST parse, nor the disjunct extraction is written in Atomese .. nor, at this stage, would I want to. Experimenting with count accumulation in Atomese would be an interesting thing to play with, I suppose. I can even offer some prototype suggestions -- the natural place to store counts is in Values. We would need something like an IncrementCountLink SomeAtom PredicateNode "some key" which adds 1.0 to the FloatValue attached to SomeAtom at "some key". ------------- In the end, I am not at all clear on what you and Ben have agreed on w.r.t probabilistic programming. Personally, myself, I am very interested in having a probabilistic-sampling rule-engine. The original idea was to convert rules to ASP, then use the Pottsdam ASP solver (which is both fast, and has a nice API) to repeatedly run the rules a zillion times, each with different probabilistic samplings, and then accumulate averages. I think that such a probabilistic rule engine could potentially beat PLN at it's own game. I'm also interested in replacing the URE by a URE-II that avoids the use of variables, and replaces chaining by constraint satisfaction. Exactly how to make that probabilistic ... well, I sketched some hints, above, but its hard. Currently, I use the link-grammar parser for experiments in probability, but it is a very special purpose system, its not general. Separately, I am very interested in various probabilistic pattern detection problems, but for those, well, this email is too long, and my thinking is cloudier. --linas but there still remain some problems (e.g. with marginalization, stochastic > recursion, the inefficiency of blind guess in comparison with the > metaheuristic search, etc.). > > > > -- Alexey > > > > -- cassette tapes - analog TV - film cameras - you -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA35ELdsgzB652FD5fmmR5nExU_%3DnLXQw_OM6xWO6eMCXdQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
