Re: [opencog-dev] Re: OpenCog, DNNs, PPLs: Atoms vs Values

Linas Vepstas Mon, 28 May 2018 13:50:20 -0700

On Mon, May 28, 2018 at 8:27 AM, Alexey Potapov <[email protected]> wrote:


> Hi.
> Here are some additional thoughts on OpenCog+PPL, which I didn't include
> in the first message
>
> *OpenCog + PPL*
>
>
>
> One of the ideas/tasks was to make an OpenCog a “better Church”.
>

Oooof. I wrote a long diatribe about religion, until I realized you must be
referring to something that Alonzo Church must have written or invented.
I am not that familiar with his work, so I don't know what a "better
Church" would be. Is there some specific paper or book?


>
>
> *1.     **Why is it possible*
>
> Universal probabilistic programming languages (PPLs) utilize
> sampling-based approaches to infer posterior probabilities. They do not
> “reason” about tasks.
>
> Consider a number of examples.
>
> 1.1)
>
> (rejection-query
>
>  (define n (random-integer 1000000))
>
>  n
>
>  (= n 10))
>
>
>
> This program says that *n* is a random integer number from 0 to 999999,
> and it tries to estimate P(*n*|*n*=10). It will take quite long also it
> is obvious that P(*n*=10|*n*=10)=1. We can imagine an (extended) Pattern
> Matcher easily deducing *n*=10 and checking it fits [0, 999999].
>

Or you could apply analytic combinatorics to obtain this result. Philippe
Flajolet has a 600 or 800 page-long book, showing how to do this.  Its
called "analytic combinatorics".


>
>
> 1.2)
>
> (rejection-query
>
>  (define n (random-integer 10))
>
>  (define m (+ n 5))
>
>  n
>
>  (= m 0))
>
>
>
> Here, rejection-query will hang forever, since the condition cannot be
> met. Again, the absence of the answer can be easily deduced (given some
> properties of +).
>

Yep.


>
>
> 1.3) Einstein's (zebra) puzzle can be easily represented in Church
>
> (define where list-index)
>
> (define what list-ref)
>
> (define (is? xs x ys y)
>
>   (eq? (what ys (where xs x)) y))
>
> (define (neighbour? x y)
>
>   (or (= x (+ y 1))
>
>       (= x (- y 1))))
>
> (rejection-query
>
>  (define colors (shuffle '(Yellow Blue Red Ivory Green)))
>
>  (define nationalities (shuffle '(Norwegian Ukrainian Englishman Spaniard
> Japanese)))
>
>  (define drinks (shuffle '(Water Tea Milk OrangeJuice Coffee)))
>
>  (define smokes (shuffle '(Kools Chesterfield OldGold LuckyStrike
> Parliament)))
>
>  (define pets (shuffle '(Fox Horse Snails Dog Zebra)))
>
>
>
>  (what nationalities (where drinks 'Water))
>
>
>
>  (and (is? nationalities 'Englishman colors 'Red)
>
>       (is? nationalities 'Spaniard pets 'Dog)
>
>       (is? drinks 'Coffee colors 'Green)
>
>       (is? nationalities 'Ukrainian drinks 'Tea)
>
>       (= (where colors 'Green) (+ (where colors 'Ivory) 1))
>
>       (is? smokes 'OldGold pets 'Snails)
>
>       (is? smokes 'Kools colors 'Yellow)
>
>       (= (where drinks 'Milk) 2)
>
>       (= (where nationalities 'Norwegian) 0)
>
>       (neighbour? (where smokes 'Chesterfield) (where pets 'Fox))
>
>       (neighbour? (where smokes 'Kools) (where pets 'Horse))
>
>       (is? smokes 'LuckyStrike drinks 'OrangeJuice)
>
>       (is? nationalities 'Japanese smokes 'Parliament)
>
>       (neighbour? (where nationalities 'Norwegian) (where colors 'Blue))
>
>       )
>
> )
>
>
>
> Again, blind search performed by PPLs is too inefficient here.
>

So why bring it up?


> OpenCog can solve this problem quite efficiently,
>

Well, that depends on how the problem is coded, representationally. If it
is coded as a graph search, then there is a combinatoric explosion -- there
are 5^5=3125 branches, and it takes a long time to explore each branch.
Due to representational difficulties in atomese, I think the representation
used in the unit test has far more branches, using more variables than
might otherwise be needed (e.g. for neighbors, it uses unordered links,
which have 5! permutations. So there are some factors of 5! in that unit
test - which is due to an over-simplified representation.)

The classic, efficient solution to this problem is to use a 5^5 tensor, and
knock out the tensor rows and columns, which is efficient, but is not
obvious how to represent this, code this up in atomese.  I'm not saying its
impossible, only that its not obvious: for a human, it remains a difficult
task to write the code that solves this efficiently.

Constraint-satisfaction programming languages do make it easy to specify
these kinds of problems: viz, it is probably quite easy to write down this
problem using ASP.  However, the current opencog rule engine does not use
the ASP syntax, (or any other constraint-satisfaction syntax that I know
of) and thus is not an efficient constraint-satisfaction solver.


> and we can imagine that an equivalent probabilistic program is written in
> Atomese and URE deductively infers the answer starting from constraints
> without actually sampling random variables.
>
>
>
> We can see that from PPLs perspective, OpenCog can be used to make
> inference over probabilistic programs much more efficient (at least, in
> some cases). Why not write these programs in Atomese directly? For what
> reason do we need a PPL metaphor?
>

Well, for several reasons. First, Atomese is a representational language,
it is NOT a compute platform.  This is important to understand: Atomese is
designed to represent things, say things, to provide a syntax that gives
meaning to ideas.  Atomese is NOT a compute platform (OK, so, de facto, it
can compute, because it does have an existing implementation; this
implementation is... inadequate in many ways).

The goal of representation is so that syntactic transformations can be
applied to the network-structures.  That is, if one has a structure of some
particular shape, then it can be transformed to another structure, in any
one of several different ways.  In wikipedia, these structural
transformations are called "rewrite rules". In model theory, these
structural transformations are called "a theory". In proof theory, these
structural transformations are called "rules of inference".   In gcc (the
compiler) these structural transformations are called "gimple". LLVM and
JVM and C# have yet another names for them. All of these different names
really describe the same general concept: the syntactic (homotopic)
transformations between shapes. (thus homotopy type theory)  Atomese is a
device for representing shapes, (and thus, implicitly, their semantic
content), and the syntactic transformations that can be applied to the
shapes.

It has been convenient to also perform computations "under the covers",
sort-of hidden away, in C++ classes. So, in this sense, Atomese is
"accidentally" a computational platform.  However, it is terribly
inefficient.  If you compare Atomese to, say the Java bytcode, the C#
bytecode, or the guile bytecode, it is literally thousands,
tens-of-thousands of times slower. (but not a million; its not that slow.)

The point here is that Atomese is a representation.  What is done under the
covers is the implementation, and that is quite a different thing.  The
current implementation is what it is. It can be improved, in a homotopic,
continuous fashion, but that will take some hard work by systems
programmers.

The URE is a different beast.  The original intent was that, since our data
consists of graphical structures, and a basic single-step graph-rewriting
primitive (the pattern matcher), that we should now build a system that can
apply multiple syntactic transformations to the knowledge graph. From this
comes the concept of "chaining": one can apply one transformation after
another after another, in a linear sequence or "chain". This then splits
into forward and backward chaining.  And this idea got implemented and is
called the URE.

It was a worthwhile experiment, but, in the end, perhaps too simplistic and
naive in conception. There are better ways. By the 1980's, academia knew
that chaining is fundamentally inefficient, and that there are several ways
to do better. One way is to approach the problem as a "parsing problem":
instead of chaining, one asks, how can one select a set of rules, and
assemble them (as "puzzle pieces") in such a way as to provide an extended
transformation?  Once you start thinking of it that way, you can ask, "what
are the efficient parse algorithms?"  The second approach to this problem
is to realize that parsing is a constraint-satisfaction problem, and so one
asks, "what are the efficient constraint-satisfaction solvers?"  Continuing
in that direction leads down the road towards SAT solvers.

Thus, in the future, I would like to see URE-II which assembles rules as if
it were solving a parsing or constraint-satisfaction problem. This would be
much much more efficient.  The syntax of the rules would have to change; a
side effect of working with the URE has been the discovery that "variables
are harmful" - we use variables for connectors, and this requires the
concept of quoting, which turns out to be a major mistake.  Live and learn.

Whatever.

OK, so now on to your question "what do we need probabilistic programming
languages for?" To answer this, I first have to explain where probability
comes in.

The graphs, above, and the syntactic transformation rules being applied to
them: they do NOT come uniformly distributed.  Even if they did, after a
few short turns of the crank, the distribution would no longer be uniform.
There are two ways to see this: pattern generation, and pattern
recognition. In pattern generation, one applies a sequence of rules, and
obtains fractals -- clearly, very highly non-uniform in distribution.  More
abstractly, there might be dozens or hundreds of different
rule-applications that all lead to the same end-point (and thus have a
large measure  or "probability" assigned to that endpoint).  Other
endpoints may be very "thin", with only one, single rule-sequence that
leads to that endpoint. (and thus that endpoint has a small measure or
"probability").

In pattern-recognition (e.g. language recognition by finite automata or
stack automata, as a basic example; or in natural-language-parsing as a
more sophisticated example) - in pattern recognition, some rules are used
more frequently than others, and thus "should be tried first".  In natural
language, where there are ambiguities, there is a correlation between the
"correct parse", or, at least, the "most likely intended parse", and the
rules used to obtain it.   There is a likelihood that can be assigned to
rules and rule combinations.

So, in light of this, what should one do?  There are two answers.  One is
the answer that Ben proposed some decade ago or more, with "probabilistic
logic networks" -- Assign probabilities to knowledge-representations,
assign probabilities to transformational rules, and use certain, specific
formulas (the "PLN formulas") to attempt to track or emulate or approximate
the true flow of probability through the system. There are several problems
with this: it is not prima facie obvious that the PLN formulas are correct.
I guess they offer some reasonable approximation of probability flow, but
it seems to be invented from thin air, a decent rule-of-thumb, but that's
all.  A second problem is that there is a combinatorial explosion during
rule application -- this is rooted in the combinatorial explosion in rule
chaining (backward or forward).  If one instead replaced the chainers by
constraint solvers, how would one, could one apply the PLN formulas? Its
not clear.

So, the idea is that probabilistic sampling, via probabilistic programming
techniques, offers a different mechanism for obtaining the probabilities.

This is best illustrated by example.  Suppose, in the Einstein problem,
that instead, we had that one person had a French mother and a Dutch
father, and thus was half-French, half Dutch.  Suppose the tea-drinker also
drank coffee one morning a week. Suppose the Kools smoker preferred Kools,
unless he ran out, and then smoked OldGold. And so on.  At this point, the
puzzle will no longer have a unique solution, but will have several
solutions, some with high probability, and thus likely; while others are
merely "plausible": consistent with the facts, but unlikely.   There are
also still a very large number of impossible assignments.   NOW how does
one try to solve the puzzle?

One answer is "use PLN". Another answer is "use probabilistic
programming".  Here, it seems possible that random sampling will be faster
than the recursive application of rules and probability propagation via
rule-formulas. It also seems that it will be more accurate.  Again: the PLN
fomulas are ad hoc: they were not derived in a rigorous way (e.g. via
"analytic combinatorics", which gives exact results).  Applying analytic
combinatorics to the (probabilistic) Einstein problem is difficult.  Thus,
frequentist sampling, using probabilistic programming will probably be not
only faster, but also far more accurate.

There is still one fairly large problem with probabilistic programming;
this goes back to the "Medieval reasoning" examples I gave earlier (which I
wrote up in a blog entry:
https://blog.opencog.org/2018/05/27/medieval-thinking-and-political-agi/ )
Probabilistic programming is effectively a sampling technique; it does not
reveal cause and effect.  That is, it will generate plausible answers
(sure, the kools smoker might be one-fourth-Greek, which is compatible with
orange juice drinking), and even the likeliest answer (the one with the
highest weight assignments)  But it does not provide a "model of the world"
or a "case for the prosecution", the way that a (Medieval Scholastic... or
modern) legal system might.    That is, in a court of law, the statement
that "the kools smoker might drink orange juice" is tantamount to saying
"the accused had time to drive there, kill the victim and drive back": it
is not a probability assignment, but rather a ruling out of
counter-factuals.   I do not know how to use probabilistic programming to
rule out counter-factuals.   Just because, out of 1000 samples, some branch
was not taken, that does not mean that the branch will not be taken on the
1001st sample. That is, probabilistic programing provides a way for
estimating probabilities. For many cases, it may be faster, and provide
more accurate answers, than using PLN and the URE rule chainer. However, it
does not provide any mechanism for counter-factual reasoning. It does not
provide a mechanism to build competing "models of the world" (viz a model
where the accused is innocent, another where the accused is guilty)

Model of the world: to be clear: the Einstein puzzle allows one to build
court cases, and make accusations: "the prosecution asserts that the
Norwegian drinks juice!" which may be true or false, but where did the
Einstein puzzle come from?  How do we even know that we should be solving
the Einstein puzzle, and not some other puzzle? Why are we talking about
Norwegian juice drinkers instead of school shooters and #metoo victims?
Building world models requires pattern discovery.

--------
I don't know if you followed all of that, or if you found it to be too
abstract or too hand-wavey.

There is a very specific, narrowly-focused place where I think that
probabilistic programming could be immediately and directly useful, and for
which it might be mostly straight-forward to write the code -- this would
be for a sampling replacement for PLN.   The algo goes like this:

For each input fact, assign an a-priori probability. Next, sample: assign
each input fact a crisp truth value of T or F (according to the
probability) and perform crisp-logic reasoning, propagating these crisp
truth values. Arrive at the conclusions(s), and count.  This gives a
frequentist sampling result for the final probability.

Several interesting things happen.
(1) it might be faster to do 1000 crisp-logic samples, than to use PLN
(2) it validates the PLN formulas (or not) - the frequentist answers should
be assumed to be "correct" -- when does PLN give answers with the
frequentist results, and when does PLN diverge? If it diverges, then why?

I think that this is a sufficiently concrete, down-to-earth, practical
problem that it can be tackled in some reasonable time-frame.  It does not
require any of the loosey-goosey abstract nonsense I was talking about
before.



>
>
> *2.     **What is missing*
>
> Consider the following example.
>
> 2.1)
>
> (rejection-query
>
>  (define x (gaussian 0 1))
>
>  (define y (gaussian 0 1))
>
>  x
>
>  (= (+ x y) 1))
>
>
>
> It will actually not work in Church, because the strict condition will
> not be satisfied (but it can be modified with the use of soft equality).
> URE (given necessary axioms) can easily infer *y*=1–*x*. However, it will
> not be able to ground variables. Actually, these are not variables to be
> grounded using number nodes, but rather values should be assigned to them.
> Also, we shouldn’t just sample *x* and then calculate *y*=1–*x*, because
> different values of *y* have different prior probability. Thus, if we
> want to estimate posterior probabilities, we should add specific mechanisms
> of taking prior probabilities of inferred values into account. Of course,
> we will also need some basic random distributions.
>

Again, URE is not very efficient.  Also, the "rules" would have to be
obtained via analytic combinatorics, and this is not easy. If it were easy,
there would not be a 600-page book on it.  It is not practical for humans
to spend time trying to hand-write rules, anyway -- they whole point is to
automate the learning of rules, from nothing, rather than to have humans
write the rules.

But the learning of rules is another, different abstract discussion.



>
>
> 2.2) Fitting a polynomial with an unknown degree.
>
> (define (calc-poly x ws)
>
>   (if (null? ws) 0
>
>       (+ (car ws) (* x (calc-poly x (cdr ws))))))
>
> (define (calc-poly-noise x ws sigma)
>
>   (+ (calc-poly x ws) (gaussian 0 sigma)))
>
> (define (generate xs ws sigma)
>
>   (map (lambda (x) (calc-poly-noise x ws sigma)) xs))
>
> (define (sum-dif2 xs ys)
>
>   (if (null? xs) 0
>
>       (+ (+ (* (- (car xs) (car ys)) (- (car xs) (car ys))))
>
>          (sum-dif2 (cdr xs) (cdr ys)))))
>
> (define (samples xs ys)
>
>   (mh-query 10 10
>
>             (define degree (sample-integer 4))
>
>             (define ws (repeat (+ 1 degree) (lambda () (gaussian 0 3))))
>
>             (define sigma (gamma 1 2))
>
>             degree
>
>             (< (sum-dif2 (generate xs ws sigma) ys) 0.5)))
>
>
>
> (define xs '(0 1 2 3))
>
> (define ys (generate xs '(0.1 1 2) 0.01))
>
> (hist (samples xs ys) "degree")
>
>
>
> This program actually works and finds a correct degree of the polynomial
> using Bayesian Occam razor that is available “for free” in PPLs.
>


I did not take the effort to try to decipher the program up above. Our
traditional answer for these kinds of problems is "use MOSES".  I did spend
some time in working on a framework that would allow MOSES to solve
differential equations (that is, given a (noisy) time series of data,
automatically discover the the differential equation that describes that
motion.  It seems doable and even reasonably efficient.  (Viz, newton
differences, "umbral calculus" effectively means that polynomials and
differential equations are "the same thing")

For some things (such as discovering differential equations in motion data)
it seems semi-reasonable to have human beings write some special-purpose,
custom software.  But in general, this runs counter to the automatic
discovery of patterns.   The above example is a large complex program that
is hard to write, hard to understand. We cannot require humans to write
large complex programs, every time some new problem comes up.  Pattern
discovery needs to be automated.


> Here, Atomese appears to be rather useless (constraints cannot be
> deductively propagated/pattern-matched, and there is no set of atoms to
> which variables can be grounded).
>

Well, careful in understanding what "Atomese" is. Again:

-- Atomese is a knowledge representation system. In so far as the above is
some snippet of knowledge, it can be represented in Atomese.

-- If you are thinking of the pattern matcher, and of using
forward-chaining to solve the problem of fitting a polynomial to a sequence
of data points, I think this is doable; you'd have to create a set of rules
defining least-squares fitting.  One of the example demo programs in the
examples directory shows how to create a finite probabilistic state machine
in Atomese. Another demo extends this into a min-quasi-crippled
hidden-markov-model type system.  In my imagination, least-squares
polynomial fitting could be done the same way, too. It seems like a
reasonable "homework problem" for someone.

But again, having humans write least-square fitting programs in Atomese is
kind-of opposite to the general direction I want to go in.  It *is* useful
to try to solve different kinds, different classes of problems in Atomese,
such as polynomial fitting, or bayesian networks or whatever -- these are
all worthwhile experiments -- it highlights, indicates where the hard parts
are, where the easy parts are, what the short-comings are, what is good and
what is bad, what needs to be rethought, re-designed.

If you are thinking of the URE -- then yes, in retrospect, the concept of
Variables is clearly a mistake.  It was a reasonable mistake to make, but
still a mistake.  It is an example of an experiment in Atomese. We learned
something from it.



> Small modifications of Pattern Matcher might be enough (e.g. introducing
> RandomVariableNode and sampling its value from a given distribution instead
> of systematically enumerating all possible groundings),
>

We already have that; except the name is different. Two of them, actually:
RandomNumberLink, and RandomChoiceLink
https://wiki.opencog.org/w/RandomNumberLink
https://wiki.opencog.org/w/RandomChoiceLink

These were originally implemented for doing a probabilistic-programming
approach to controlling the Hanson Robotics Sophia robot.

We do NOT have a CountVariableNode that could accumulate counts of how
often some particular event/branch/clause was explored.  Currently we
accumulate frequentist counts in two major OpenCog subsystems: one is in
the Pattern Miner, the other is in the natural-language-learning
subsystem.   Both accumulate the counts in an ad-hoc way, outside of
atomese.

It might be (should be?!!) interesting to accumulate counts inside of
Atomese proper, in a principled way.  However, this is easier said than
done.  For example, in language learning, first we observe pairs of words,
and increment counts - not done in Atomese, but could be, and its not that
hard.  But the second step is to perform an "MST parse", create disjuncts
and accumulate statistics on those. Neither the MST parse, nor the disjunct
extraction is written in Atomese .. nor, at this stage, would I want to.

Experimenting with count accumulation in Atomese would be an interesting
thing to play with, I suppose.  I can even offer some prototype suggestions
-- the natural place to store counts is in Values.  We would need something
like an

IncrementCountLink
     SomeAtom
     PredicateNode "some key"

which adds 1.0 to the FloatValue attached to SomeAtom at "some key".

-------------

In the end, I am not at all clear on what you and Ben have agreed on w.r.t
probabilistic programming.   Personally, myself, I am very interested in
having a probabilistic-sampling rule-engine. The original idea was to
convert rules to ASP, then use the Pottsdam ASP solver (which is both fast,
and has a nice API) to repeatedly run the rules a zillion times, each with
different probabilistic samplings, and then accumulate averages. I think
that such a probabilistic rule engine could potentially beat PLN at it's
own game.

I'm also interested in replacing the URE by a URE-II that avoids the use of
variables, and replaces chaining by constraint satisfaction. Exactly how to
make that probabilistic ... well, I sketched some hints, above, but its
hard. Currently, I use the link-grammar parser for experiments in
probability, but it is a very special purpose system, its not general.

Separately, I am very interested in various probabilistic pattern detection
problems, but for those, well, this email is too long, and my thinking is
cloudier.

--linas


but there still remain some problems (e.g. with marginalization, stochastic
> recursion, the inefficiency of blind guess in comparison with the
> metaheuristic search, etc.).
>
>
>
> -- Alexey
>
>
>
>


-- 
cassette tapes - analog TV - film cameras - you

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA35ELdsgzB652FD5fmmR5nExU_%3DnLXQw_OM6xWO6eMCXdQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [opencog-dev] Re: OpenCog, DNNs, PPLs: Atoms vs Values

Reply via email to