On Mon, Apr 28, 2014 at 9:48 AM, Alex Miller <[email protected]> wrote:

> (Please let me know if this isn't the best forum for this kind of question)
>

If you have technical questions about AIXI, you may get better answers on
the dedicated mailing list:
https://groups.google.com/forum/?fromgroups#!forum/magic-list
which covers mathematical AGI and AIXI and Solomonoff Induction in
particular.


> Suppose we give a long sample from the Bernoulli distribution to AIXI.
> I would expect an intelligent agent to figure out that it cannot see
> any patterns there and be able to continue the sequence accordingly.
>
> However, I think the shortest programs that reproduce such a sequence
> (the ones that should dominate the hypothesis space, according to
> AIXI) be the ones that copy the given sequence and then produce
> something highly regular, e.g. all zeros. Is this correct?
>

Let's consider Solomonoff Induction (SI) which is at the heart of AIXI, and
is made just for prediction (AIXI is made for choosing actions optimally).

SI does not consider only the simplest deterministic program, otherwise it
would have the behavior you predict, which is not good enough.
What you describe is called Minimum Description Length, which is a
restriction of SI and is mainstream in machine learning.

Instead, SI considers *all* computable programs that are consistent with
the observed sequence.
Since a coin-flip sequence has no particular (deterministic) structure,
some of the consistent programs will predict that the next bit is 0, and
some will predict that it is 1.
Now, since SI makes a mixture (weighted sum) of the predictions of all
these consistent programs, it will in general predict 0 with probability
more or less 1/2. It's not a perfect 0.5 though because there are
fluctuations in the set of consistent programs, so sometimes it will be
more than 0.5, sometimes less. (In particular if there has been more 0s
than 1s in the past, 0 will have in general a higher probability.)

A bit more details:

Say we are at time T and the current observed sequence generated by the
coin-flip process is x=010010101011.
For a program p to be consistent with x, the execution of p must produce
the sequence x, and possibly more.
Also, the weight (in the mixture) of the program p is 2^(-length(p)), so
the weight is lower if the length (in bits) of p is larger.

Now, consider the program p0 that outputs always 0. This program is not
consistent with the x above, so it is not considered in the mixture.
However, we can make a specific program p0' that outputs exactly x, and
then outputs always 0.
Since there is no structure in x (we say that x is random), the best we can
do to output x is the program "print 010010101011".
So the length of p0' is roughly equal to the length of x (which is T) plus
the length of "then print 0 forever".
p0' is consistent with x, but its weight is small compared to p0. It's
actually roughly 2^-T smaller.

Now, if the next bit, at time T+1, that we receive from the coin-flip
generator is 1, p0' becomes inconsistent and is removed from the mixture.
But again, there is a program p0'' that is like p0' but outputs 1 instead
of 0 for time T+1. But the weight of p0'' is smaller than that of p0'
(actually it's roughly 2^-1=1/2 of p0') because it looks like "print
010010101011*1* then print 0 forever".
You can easily imagine programs p1, p1', p1'', etc. that output 1 instead
of 0. Their weight is very close (if not equal) to those of p0, p0' and
p0'', and this is partly why 0 and 1 will have roughly probability 1/2 each.

There is actually an /equivalent/ definition of SI in terms of *stochastic*
programs (instead of deterministic programs), which class contains the
Bernoulli distribution (and all computable distributions actually).
It is easy to show that SI converges to optimal prediction after K(q)
prediction errors, where K(q) is the size of the smallest program that is
equivalent to q (which is very small if the sequence to predict is not
random!), even if the generator is stochastic. It is not easy to grok why
the deterministic and stochastic definitions are equivalent (the proof is
hard to read), but hopefully the example above gives you some insight.

If you want more details on Solomonoff Induction and AIXI, I recommend
Shane Legg's thesis, which is a nice introduction:
http://www.vetta.org/documents/Machine_Super_Intelligence.pdf‎

Hope this helps,
Laurent



-------------------------------------------
AGI
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/21088071-f452e424
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21088071&id_secret=21088071-58d57657
Powered by Listbox: http://www.listbox.com

Reply via email to