SOLOMONOFF MACHINES – UP CLOSE AND PERSONAL

This is in response to Shane Legg’s post of Fri 11/9/2007 9:40 AM and
Lukasz Stafiniak’s posts of Fri 11/9/2007 7:13 AM and Fri 11/9/2007 12:09
PM.

==Preface====================
Here are a list of questions I have from reading these very helpful posts.
I have placed most of my questions under language copied from Lukasz’s two
posts, because of its question and answer nature is proved more heading
like quotes, and because many of the questions I had in response to
Shane’s post fit under such headings.  In the following “ED######>”
preceded by “>>” indicates a question or statement from me that Lukasz was
responding to.  A divider “==Q1========================” precedes each
portion of text quoted from one of Lukasz’s or Shane’s posts, or which
starts a different area of questioning by me.  These dividers could be
used to indicate your subject matter  if you chose to make your response
to one or a few of them at a time.

Shane, I would be interested in reading your PhD thesis. (But I don’t know
how much of it I will be able to understand.)

Don’t feel obligated to answer all of this at once or even ever.  But I
would appreciate as many answers as possible

==Q1========================
>Lukasz######> Kolmogorov complexity is "universal". For probabilities,
you need to specify the probability space and initial distribution over
this space.
ED######> I understand the importance of this: i.e. there are many
different types of distributions, binary, one-of-n, integer-in-range
[x,y], complex-in-range [x+ai, y+bj], vector-in-n-dimensional-space,
distributions-of-distributions, etc.

So is the real significance of the universal prior, not its probability
value given in a given probability space (which seems relatively
unimportant, provided is not one or close to zero), but rather the fact
that it can model almost any kind of probability space?

How does the Kolmogorov complexity help deal with this problem?  Is it
because, since its major form of representation is programming on a
Turing-equivalent machine, it has the capacity to represent virtually any
type of space that can be represented by a computer.  It seems to me just
a system that could remember significant portions of its sensory inputs
and machines states and that could match against them (such as Novamente)
could have something approaching the same generality for a given degree of
computing power, minus the, at least to me, not totally well understood
potential magic of program based representation)

But how does a Solomonoff machine in the real world, decide what type of
probability space should be used for a given distribution?  And how does
it decide what is to be treated as a variable?

It seems to me before it selects a probability space for a variable it
first has to decide what it considers to be a given variable (other than
when dealing with primiate sensory inputs).  That is, you can develop a
model for variation unless you know what it is that’s varying.  It is my
feeling that any generalization is a variable, because it can have
multiple instantiations, so it seems to me that you have to build up
generalizations, before you can have variables.  For example, the set of
integers is a set defined by a generalization about countable value.  It
seems to me that the variations of instances of the generalization define
possible dimensions for the associated probability space.

I assume the Solomonoff machine defines generalizations as covering
patterns that can be produced by programs that are similar, and one such
measure of similarity is the Normalized Compression Distance, Shane turned
me on to.   Presumably some measure of importance is involve in deciding
what patterns are important, unless you are in the la-la land of unlimited
computational resources.

How are probabilities represented?  I assume once a space has been picked
you just use normal Bayesian probabilities using the Solomonoff prior
(which again seem to be much less important than the picking of the space
provided it is not a clearly stupid value.)

==Q2========================
>> ED######>  So are the programs just used for computing Kolmogorov
complexity or are they also used for generating and matching patterns.
>Lukasz######> It is difficult to say: in AIXI, the direct operation is
governed by the expectimax algorithm, but the algorithm works "in future"
(is derived from the Solomonoff predictor). Hutter mentions alternative
model AIXI_alt, which models actions the same way as the environment...
ED######> ??Shane??, what are the major ways programs are used in a
Solomonoff machine?  Are they used for generating and matching patterns?
Are they used for generating and creating context specific instantiations
of behavioral patterns?

Is not the expectimax algorithm not just a standard serial algorithm for
reinforcement learning, except that it has: a recording of all its entire
history; and the ability to search all possible sequences of possible
actions based, presumably, on perfect induction of probabilities from that
history?  How would it be wed with the Solomonoff machine.

==Q3========================
>Lukasz######> It is automatic: when you have a program with a good enough
match, then you can "parameterize" it over the difference and apply twice,
thus saving the code.
ED######>  what do you mean by “parameterize” it, do you mean (a) tweak
parameters in the code so that’s it changes it’s the manifestation of a
program, (b) that extra or different code is added to make up all the
differences, (c) tune it to accept less than totally correct matches, or
(d) all of the above.

==Q4========================
>Lukasz######> Remember that the programs need to represent the whole
history.
ED######>  Do, you mean that the library of all the programs must be able
to represent the whole history?

Do you also mean that there is one program that represents the whole
program?

==Q5========================
>>ED######>  Can the programs learn that similar but different patterns
are different views of the same thing? Can they learn a generalizational
and compositional hierarchy of patterns?
>Lukasz######> With an egzegetic enough interpretation...
ED######> Re: learning different views of similar things, it is generally
believed that in humans much of such learning of the similarity of
differing views is learned by experiencing spatial and temporal continuity
between sequences of view transformations.  If your system is any good at
generalized pattern learning it should be able to learn a generality for
things that have temporal and spatial continuity and the importance of
that generality.  It is one of the key characteristics of most physical
objects.

Re: learning generalization and compositional pattern hierarchies, I,
Novamente, and I assume a large number of other people have some pretty
good ideas about how to do that.

Re: “egzegetic”  There was only one, repeat one, hit for “egzegetic” on
Google, and I think it was a reference to another paper.  So what does it
mean  -- exegesis?  Do you meant “if had enough ability to draw meaning
out of the machine’s history well enough?”

==Q6========================
>Lukasz######> these programs will be compact description o[f] data when
enough data gets collected, so their (posterior) probability will grow
with time. But the most probable programs will be very cryptic, without
redundancy to make the structure evident.
ED######> I don’t understand how probabilities are used in a solomonoff
machine.  But I assume it is as I said under heading 1 above.

==Q7========================
>Lukasz######> The programs do not compute K complexity, they (their
length) _are_ (a variant of) Kolmogorov complexity. The programs compute
(predict) the environment.
ED######> Yes, about length of programs relating to complexity.

Does the machine also learn by predicting and comparing predictions to
outcomes?

Why would the more frequent programs or pats, become more compactly
represented?  Do you mean a more efficient rep would be made for a
repeating pattern, because (a) more time would be spent trying to compress
them, or (b) that the plurality of repeated patterns corresponding to a
given generalization would be more compactly represented because there
would be common generalization that could with a little tweaking be used
to represent all of them.

==Q8========================
>Lukasz######> The programs are generally required to exactly match in
AIXI (but not in AIXItl I think).
ED######> ??Shane??, could you please give us an assist on this one? Is
exact matching required?  And if so, is this something that could be
loosened in a real machine?

==Q9========================
>>ED######>  ...does it know when a match is good enough that it can be
relied upon as having some significance?
>Lukasz######> ...the "significance" is provided by the compression on
representation of similar things, which favors the same sort of similarity
in the future.
ED######> Why does compression favor representation of similar things?  Is
Kolmogorov complexity used to determine probability for purposes of
inference or is that done by determining frequencies of occurrences of
similar patterns using Bayesian probabilities.

==Q10========================
>>ED######>  Can they run on massively parallel processing.
>Lukasz######> I think they can... In AIXI, you would build a summation
tree for the posterior probability.
ED######> are you referring to a summation tree that would sum over the
occurrence of similar patterns at different epochs in past history to
determine how often they have occurred and how often they have occurred in
similar patterns.

==Q11========================
>Lukasz######> To be optimal, the expectimax must be performed
chronologically from the end of the horizon (dynamic programming
principle: close to the end of the time horizon, you have smaller planning
problems -- less opportunities; from smaller solutions to smaller problems
you build bigger solutions backwards in time).
ED######> Is the horizion (a) the present, or (b) as far in the future as
the machine is going to think?  I presume it is (b).

Can the expectimax algorithm both forward and backward chain?

What control over its own behavior does the expectimax have, such as
switching between forward and backward chaining, or focusing of attention,
or does it avoid all such issues by claiming it will compute it all in all
ways?

==Q12========================
>> ED######>> are these short codes sort of like Wolfram little
codelettes, that can hopefully represent complex patterns out of very
little code, or do they pretty much represent subsets of visual patterns
as small bit maps.
>Lukasz######> It depends on reality, whether the reality supports
Wolfram's hypothesis.
ED######> By Wolfram codelettes, I did not mean game of Life codelets, I
meant compact programs that could represent a lot with a little.

In one of his published papers Ben Goertzel said that MOSES, which is an
evolutionary programming component of Novamente, is good at learning
modest size programs.  But is not good at learning a big program from
scratch because the search space is explodes as problem complexity does.
But I have no feeling for what size problems it can attach at what speed
with what hardware.

But MOSES is capable of compositional learning over time, so that after it
learns a lot of little codelettes, it could then use them as pieces in
bigger codelettes.

I have no idea how efficient it is to find computer programs that can
represent patterns much more efficiently than a literal recording (other
than using standard data compression techniques). -- other than by
building up a gen/comp hierarchy of patterns rather literally derived from
patterns below them, as the brain does.  This is what Jeff Haskins calls
hierarchical memory.  A good example of which is described, with regard to
the visual “what” system, in Serre’s paper I have cited so many times
before (
http://cbcl.mit.edu/projects/cbcl/publications/ps/MIT-CSAIL-TR-2006-028.pd
f ).

Isn’t there a large similarity between a Solomonoff machine that could
learn a hierarchy of pattern representing programs and Jeff Hawking’s
hierarchical learning (as represented in the Serre paper).  One could
consider the patterns at each level of the higherarchy as sub-routines.
The system is designed to increase its representational efficiency by
having representational subroutines available for use by multiple
different patterns at higher compositional levels.  To the extent that a
MOSES-type evolutionary system could be set to work making such
representations more compact, it would become clear how semi-Solomonoff
machines could be made to work in the practical world.

==Q13========================
ED######> How is pattern matching done in a Solomonoff machine?  Are all
the pattern representing programs run against the current or past input to
be analyzed?  Is parallelism used, and if so, could it not be used in a
manner to the pattern matching in the Serre paper.

==Q14========================
ED######> How does a Solomonoff machine take into account the difference
between complexity of occurrence-and-complexity of observation?

I assume it could have different pattern modeling components that model
these different types of complexity, and then allowing you to factor out
the complexities of observation when trying to calculate the complexities
of observation?  Shane said a machine trying to interpret a stream of
video would have to be trained up on video input to be able to find
structure in the world observed by such video

==Q15========================
ED######> Is a Solomonoff machine simply a machine that uses Solomonoff
induction? Or, as from the above discussion, is it much more?  Is there
any commonly understood definition for Solomonoff machine and, if so, what
is it?

The def of Solomonoff induction on the web and even in Shane Legg’s paper
“Solomonoff induction” make it sound like it is merely Bayesian induction,
using the picking of priors based on Kolmogorov complexity.  But
statements made by Shane and Lukasz appears to imply that a Solomonoff
machine uses programming and programming size as a tool for pattern
representation, generalization, learning, inference, and more.

==Q16========================
>Shane######>...our prior knowledge of the world strongly biases how we
interpret new information.  So, to use your example, we all know that
people living on a small isolated island are probably genetically quite
similar to each other.  Thus, if we see that one has brown skin, we will
guess that the others probably do also.  However, weight is not so closely
tied to genetics, and so if one is obese then this does not tell us much
about how much other islanders weigh.
Out-of-the-box a Solomonoff machine doesn't know anything about genetics
and weight, so it can't make such inferences based on seeing just one
islander.  However, if it did have prior experience with genetics etc.,
then it too would generalise as you describe using context.  Perhaps the
best place to understand the theory of this is section 8.3 from "An
Introduction to Kolmogorov complexity and its Applications" by Li and
Vitanyi.  You can also find some approximations to this theory that have
been applied in practice to many diverse problems under the title of
"Normalized Compression Distance" or NCD.  A lot of this work has been
done by Rudi Cilibrasi.
ED######> I found what appeared to be a good def of NCD at
http://www.complearn.org/ncd.html, i.e., NCD(x,y) = C(xy)-min{C(x),C(y)} /
max {C(x),C(y)}, where C(x) is a complexity measure for which the
Kolmogorov complexity can be used.

So I think (but I could well be wrong) I know what that means.
Unfortunately I am a little fuzzy about whether NCD would take “what”
information, “what-with-what” or binding information, or frequency
information sufficiently into account to be an optimal measure of
similarity.  Is this correct?

If so, it seems, to me one might have a better measure of similarity if it
actually included such additional types of information.  If any of these
types of information are included in NCD, please explain how?

It appears (at least to me) that a Solomonoff machine is going to have to
develop similarity measures based on more than complexity alone to do any
sort of sophisticated pattern matching, for tasks involving learning from
video.  So, it seems to me that similarity measures capable of using such
more sophisticated similarity information could be developed by a
Solomonoff machine.  Is the correct or totally off base?

Thank you for whatever enlightenment you can give me and other readers on
this list by answering these questions.

Ed Porter
(617) 494-1722
Fax (617) 494-1822
[EMAIL PROTECTED]

-----
This list is sponsored by AGIRI: http://www.agiri.org/email
To unsubscribe or change your options, please go to:
http://v2.listbox.com/member/?member_id=8660244&id_secret=63797694-101495

Reply via email to