Optimal Prediction

Juergen Schmidhuber Tue, 26 Mar 2002 05:07:35 -0800

Many on this list have discussed the anthropic principle. It
essentially says that the conditional probability of finding
yourself in a universe compatible with your existence equals 1.


But unfortunately the anthropic principle does not have any
predictive power. It does NOT predict there won't be any flying
rabbits tomorrow.

Clearly we need more than the anthropic principle to explain
the predictability of our universe. In particular, we need an 
optimal way of predicting the future, given past observations. 

And there is one!

Normally we do not know the true conditional probability 
distribution p(next event | past). But assume we do know that 
p is in some set P of distributions. Choose a fixed weight w_q 
for each q in P such that the w_q add up to 1 (for simplicity, 
let P be countable). Then construct the Bayesmix 
M(x) = Sum_q w_q q(x), and predict using M instead of the
optimal but unknown p.

How wrong is it to do that? The recent exciting work of my postdoc
Marcus Hutter (IDSIA) provides general and sharp (!) loss bounds:

Let LM(n) and Lp(n) be the total expected losses of the M-predictor 
and the p-predictor, respectively, for the first n events. Then 
LM(n)-Lp(n) is at most of the order of sqrt[Lp(n)]. That is, M is 
not much worse than p. And in general, no other predictor can do 
better than that!

In particular, if p is deterministic, then the M-predictor soon 
won't make any errors any more.

If P contains ALL recursively computable distributions, then M 
becomes the celebrated enumerable universal prior. That is, after 
decades of somewhat stagnating research we now have sharp loss 
bounds for Solomonoff's universal (but incomputable) induction 
scheme. And if we also include the distributions computable in 
the limit, we get sharp loss bounds for the even more general 
priors introduced in "Algorithmic Theories of Everything": 
http://www.idsia.ch/~juergen/toesv2/

Similarly, if we replace M by the Speed Prior S - where S(x) is 
small if x is hard to compute by any method - we obtain appropriate 
loss bounds for computable S-based induction:
http://www.idsia.ch/~juergen/toesv2/node32.html

Alternatively, reduce M to what you get if you just add up 
weighted estimated future finance data probabilities generated by 
1000 commercial stock-market prediction software packages. If only 
one of them happens to work fine (but you do not know which) you 
still should get rich.

To learn more, please read

Optimality of Universal Bayesian Sequence Prediction for General Loss  
and Alphabet:        ftp://ftp.idsia.ch/pub/techrep/IDSIA-02-02.ps.gz

and also check out Hutter's other recent papers at ICML, ECML, NIPS,
Int. J. of Foundations of CS: www.idsia.ch/~marcus



Juergen Schmidhuber
http://www.idsia.ch/~juergen/

Optimal Prediction

Reply via email to