Re: Wolfram 2,3 Turing Machine
Impressive result by Alex Smith! Funny though how Wolfram's web sites on this print Wolfram's name in larger font and more frequently than Smith's, even trying to sell this as New Kind Of Science although it's just a continuation of a decades-old search for small universal Turing machines :-) BTW, check out Marcus Hutter's older posting to the Kolmogorov Complexity mailing list on whether such machines should really count as UTMs or not: http://mailman.ti-edu.ch/pipermail/kolmogorov/2007/000245.html JS http://www.idsia.ch/~juergen/computeruniverse.html http://www.idsia.ch/~juergen/wolfram.html On Oct 24, 2007, at 8:32 PM, Tom Caylor wrote: We're excited to announce that the $25,000 Wolfram 2,3 Turing Machine Research Prize has been won. Alex Smith, a 20-year-old undergraduate in Birmingham, UK, has given a 40-page proof that Wolfram's 2,3 Turing machine is indeed universal. This result ends a half-century quest to find the simplest possible universal Turing machine. It also provides strong further evidence for Wolfram's Principle of Computational Equivalence. The official prize ceremony is planned for November at Bletchley Park, UK, site of Alan Turing's wartime work. For more information about the prize and the solution, see: http://www.wolframprize.org Stephen Wolfram has posted his personal reaction to the prize at: http://blog.wolfram.com/2007/10/the_prize_is_won_the_simplest.html smime.p7s Description: S/MIME cryptographic signature
Re: measure problem
Hi Max, in this particular universe it's going well, thank you! As promised, I had a look at your paper. I think it is well written and fun to read. I've got a few comments though, mostly on the nature of math vs computation, and why Goedel is sexy but not an issue when it comes to identifying possible mathematical structures / universes / formally describable things. I think some of the comments are serious enough to affect the conclusions. Some come with quotes from papers in http://www.idsia.ch/~juergen/computeruniverse.html where several of your main issues are addressed. Some are marked by Serious. I am making a cc to the everythingers, although it seems they are mostly interested in other things now - probably nobody is really going to read this tedious response which became much longer than I anticipated. 1. An abstract baggage-free mathematical structure does not exist any more than a baggage-free computer - the particular axiomatic system you choose is like the set of primitive instructions of the computer you choose. Not very serious, since for general computers and general axiomatic systems there are invariance theorems: changing the baggage often does not change a lot, so to speak. But it should be mentioned. 2. p 11: you say that data sampled from Gaussian random variables is incompressible - NOT true - give short codes to probable events (close to the mean), long codes to rare events (Huffman coding). 3. same sentence: how to test what inflation predicts? How to test whether the big bang seed was really random, not pseudo-random? The second million bits of pi look random but are not. We should search for short programs compressing the apparent randomness: http://www.idsia.ch/~juergen/randomness.html 4. p 15: Mathematical structure (MS) just exists. Is that so? Others will look at your symbols and say they are just heaps of chalk on a blackboard, and you need a complex, wet pattern recognition system to interpret them. Here's where beliefs enter... 5. p 18: mathematical structures, formal systems and computations are aspects of one underlying transcendent structure whose nature we don't fully understand But we do! I'd say there are NO serious open problems with your figure 5 - formal systems vs math vs computation is a well-explored field. More about this below. The 2000 paper (your nr 17) exploits this understanding; it turns out the most convenient way to deal with the measure problem is the computer science way (right hand corner of your figure 5). As I wrote in the 2000 paper: http://arxiv.org/abs/quant-ph/0011122 The algorithmic approach, however, offers several conceptual advantages: (1) It provides the appropriate framework for issues of information-theoretic complexity traditionally ignored in pure mathematics, and imposes natural complexity-based orderings on the possible universes and subsets thereof. (2) It taps into a rich source of theoretical insights on computable probability distributions relevant for establishing priors on possible universes. Such priors are needed for making probabilistic predictions concerning our own particular universe. Although Tegmark suggests that ``... all mathematical structures are a priori given equal statistical weight'' [#!Tegmark:98!#](p. 27), there is no way of assigning equal nonvanishing probability to all (infinitely many) mathematical structures. Hence we really need something like the complexity-based weightings discussed in [#!Schmidhuber:97brauer!#] and especially the paper at hand. (3) The algorithmic approach is the obvious framework for questions of temporal complexity such as those discussed in this paper, e.g., ``what is the most efficient way of simulating all universes?'' 6. Serious: run the sim, or just describe its program? Are you sure you know what you want to say here? What's the precise difference between program bitstrings and output bitstrings? The bitstrings generated by the programs (the descriptions) are just alternative descriptions of the universes, possibly less compact ones. You as an external observer may need yet another program that translates the output bits (typically a less compressed description) into video or something, to obtain the description your eyes want. Note that the 2000 paper and the 2002 journal variant don't really care for time evolution, just for descriptions - within the bitstrings maybe there is an observer who thinks he knows what's time, but to the outsider his concept of time may be irrelevant. (Unlike the 1997 paper, the 2000/2002 papers do not focus on a one to one mapping between physical and computational time steps, otherwise we'd miss all the universes where the concept of time is irrelevant.) Here's what I wrote at the end: After all, algorithmic theories of the describable do encompass everything we will ever be able to talk and write about. Other things are simply beyond description. 7. Serious: p 18 CUH: what's your def of computable? You mean by
Zuse Symposium: Is the universe a computer? Berlin Nov 6-7
Dear colleagues, many interesting talks at the Zuse Symposium: Is the universe a computer? Berlin Nov 6-7, 2006 http://www.dtmb.de/Aktuelles/Aktionen/Informatikjahr-Zuse/ Best regards, -JS http://www.idsia.ch/~juergen/computeruniverse.html --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Everything List group. To post to this group, send email to everything-list@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/everything-list -~--~~~~--~~--~--~---
probabilities measures computable universes
I browsed through recent postings and hope this delayed but self-contained message can clarify a few things about probabilities and measures and predictability etc. What is the probability of an integer being, say, a square? This question does not make sense without a prior probability distribution on the integers. This prior cannot be uniform. Try to find one! Under _any_ distribution some integers must be more likely than others. Which prior is good? Is there a `best' or `universal' prior? Yes, there is. It assigns to each integer n as much probability as any other computable prior, save for a constant factor that does not depend on n. (A computable prior can be encoded as a program that takes n as input and outputs n's probability, e.g., a program that implements Bernoulli's formula, etc.) Given a set of priors, a universal prior is essentially a weighted sum of all priors in the set. For example, Solomonoff's famous weighted sum of all enumerable priors will assign at least as much probability to any square integer as any other computable prior, save for a constant machine-dependent factor that becomes less and less relevant as the integers get larger and larger. Now let us talk about all computable universe histories. Some are finite, some infinite. Each has at least one program that computes it. Again there is _no_ way of assigning equal probability to all of them! Many are tempted to assume a uniform distribution without thinking much about it, but there is _no_ such thing as a uniform distribution on all computable universes, or on all axiomatic mathematical structures, or on all logically possible worlds, etc! (Side note: There only is a uniform _measure_ on the finitely many possible history _beginnings_ of a given size, each standing for an uncountable _set_ of possible futures. Probabilities refer to single objects, measures to sets.) It turns out that we can easily build universal priors using Levin's important concept of self- delimiting programs. Such programs may occasionally execute the instruction request new input bit; the bit is chosen randomly and will remain fixed thereafter. Then the probability of some universe history is the probability of guessing a program for it. This probability is `universal' as it does not depend much on the computer (whose negligible influence can be buried in a constant universe-independent factor). Some programs halt or go in an infinite loop without ever requesting additional input bits. Universes with at least one such short self-delimiting program are more probable than others. To make predictions about some universe, say, ours, we need a prior as well. For instance, most people would predict that next Tuesday it won't rain on the moon, although there are computable universes where it does. The anthropic principle is an _insufficient_ prior that does not explain the absence of rain on the moon - it does assign cumulative probability 1.0 to the set of all universes where we exist, and 0.0 to all the other universes, but humans could still exist if it did rain on the moon occasionally. Still, many tend to consider the probability of such universes as small, which actually says something about their prior. We do not know yet the true prior from which our universe is sampled - Schroedinger's wave function may be an approximation thereof. But it turns out that if the true prior is computable at all, then we can in principle already predict near-optimally, using the universal prior instead: http://www.idsia.ch/~juergen/unilearn.html Many really smart physicists do not know this yet. Technical issues and limits of computable universes are discussed in papers available at: http://www.idsia.ch/~juergen/computeruniverse.html Even stronger predictions using a prior based on the fastest programs (not the shortest): http://www.idsia.ch/~juergen/speedprior.html -Juergen Schmidhuber
everything talk - new AI
I put everything talk slides on the web: http://www.idsia.ch/~juergen/everythingtalk/ http://www.idsia.ch/~juergen/computeruniverse.html And also a recent overview article: The New AI: General Sound Relevant for Physics J. Schmidhuber. TR IDSIA-04-03, arxiv:cs.AI/0302012 To appear in B. Goertzel and C. Pennachin, eds.: Artificial General Intelligence http://www.idsia.ch/~juergen/newai/newai.html Abstract. Most traditional artificial intelligence (AI) systems of the past 50 years are either very limited, or based on heuristics, or both. The new millennium, however, has brought substantial progress in the field of theoretically optimal and practically feasible algorithms for prediction, search, inductive inference based on Occam's razor, problem solving, decision making, and reinforcement learning in environments of a very general type. Since inductive inference is at the heart of all inductive sciences, some of the results are relevant not only for AI and computer science but also for physics, provoking nontraditional predictions based on Zuse's thesis of the computer-generated universe. Comments welcome! Juergen Schmidhuberhttp://www.idsia.ch/~juergen/
Re: Zuse's thesis web site
Thanks for your comments, Hal! Your point on the dovetailer is well taken. I modified the text: If Zuse's thesis is correct, then which is our universe's program? It may come as a surprise to some that we already know at least one algorithm that does compute our universe (and several others)... Ed Clark has a nice review page on Wolfram's book: http://www.math.usf.edu/~eclark/ANKOS_reviews.html It includes Scott Aaronson's interesting review which also addresses the issue of Bell's inequality. Best, Juergen http://www.idsia.ch/~juergen/digitalphysics.html Juergen Schmidhuber writes: I welcome feedback on a little web page on Zuse's 1967 thesis (which states that the universe is being computed on a cellular automaton): http://www.idsia.ch/~juergen/digitalphysics.html That's very interesting; I was not aware of Zuse. Unfortunately I don't know German so I can't read his paper. Regarding the question of the compatibility of CA models with relativity and QM, Wolfram looks into this in some detail. He essentially abandons a simple CA model in favor of a more complex network of interacting nodes, which has some features similar to the Lorentz transformation of relativity. Then to address the EPR style long-distance correlations of QM, he proposes that while the network is mostly local, it has occasional nodes which get stretched apart and are connected to distant nodes. These are rare but allow for the type of information flow necessary to reproduce long-distance QM correlations. All in all it is a pretty ad hoc and unconvincing model. I tried to read the t'Hooft paper referenced here but it was over my head. It also struck me though as not really addressing the discrepancy between long-distance correlations and local CA models. It seems very much an open and difficult question to me to show how a local CA model can reproduce relativity and QM. One issue which CA models tend to ignore is the MWI. Most CA models are built as hidden variable theories which define a single universe. Some multiverse models have that structure as well. But it seems to me that this is an entirely unnecessary restriction. If a CA can model a universe, it can model a multiverse, and likewise with any other computing model like TMs. The MWI is fully deterministic, which may make it a more attractive target for modelling with a deterministic computational theory than attempting to reproduce the statistical phenomena of QM, essentially via hidden variables. Any hidden variable theory, CA based or not, has two strikes against it from the beginning due to the the many well known difficulties of Bell inequalities and EPR correlations. Regarding entropy, it is pointed out that entropy does not grow in a CA model. Wolfram discusses this as well. While entropy technically does not grow, you can get phenomena that look very much like entropy growth in a CA model. Eventually you will get a Poincare recurrence if the universe is finite. But if you start in a sufficiently simple state, there are many CA models which will mimic entropy growth into a more complex state. And this may be close enough to explain our universe. Alternatively, of course the MWI as a deterministic theory also does not have entropy growth. As mentioned above, computational models of our universe might well do better to aim towards an MWI world. As far as the claim that we already know the algorithm that runs our universe, and it is the UD: I think this is amusing but ultimately misleading. It's true that a dovetailer which runs all programs will indeed run our own universe's program (assuming it has one), but I think it is a misuse of terminology to say that the UD is the algorithm that is running our universe. I would reserve that phrase to refer to the specific program that generates our universe and no others. It will be a tremendous accomplishment of physics and philosophy when that program is discovered, but it is misleading to give the impression that we already know what it is. I think a better terminology here would be something like, we don't need to know the specific program that describes our universe in order to imagine how to program a computer that would in fact generate our experiences, at least in theory. And then go on and explain about running all programs at once, etc. Hal Finney
Zuse's thesis web site
I welcome feedback on a little web page on Zuse's 1967 thesis (which states that the universe is being computed on a cellular automaton): http://www.idsia.ch/~juergen/digitalphysics.html Juergen Schmidhuber
Re: wheeler walked away; Zuse ...
[EMAIL PROTECTED] wrote: as for your point in your post about wheeler attaching his name to the theory, I think its ok for proponents and not originators of a theory to be named along with it. for example lately Ive been referring to the fredkin-wolfram thesis. fredkin is far more the originator; wolfram is far more the proponent. seems to me the everett-wheeler theory can be fairly seen in the same way. just to set the record straight: the originator was Zuse. Here is a link to a PDF scan of his 1967 paper (on the universe running on a cellular automaton): http://www.idsia.ch/~juergen/zuse.html
colt
Physics is an inductive science. You try to find a law that matches the data, and hope that it generalizes on unseen data. When asked about the nature of their inductive business, most physicists refer to Popper's ideas from the 1930s (falsifiability etc), and sometimes to Occam's razor: simple explanations are better explanations. Most physicists, however, are not even aware of the fact that there is a formal basis for their inductive science, provided by the field of computational learning theory (COLT), in particular, the theory of universal induction. The contents of the following COLT paper may be old news to some on this list. -- The Speed Prior: a new simplicity measure yielding near-optimal computable predictions (Juergen Schmidhuber, IDSIA) In J. Kivinen and R. H. Sloan, eds, Proc. 15th Annual Conf. on Computational Learning Theory (COLT), 216-228, Springer, 2002; based on section 6 of http://arXiv.org/abs/quant-ph/0011122 (2000) http://www.idsia.ch/~juergen/speedprior.html ftp://ftp.idsia.ch/pub/juergen/colt.ps Solomonoff's optimal but noncomputable method for inductive inference assumes that observation sequences x are drawn from an recursive prior distribution mu(x). Instead of using the unknown mu(x) he predicts using the celebrated universal enumerable prior M(x) which for all x exceeds any recursive mu(x), save for a constant factor independent of x. The simplicity measure M(x) naturally implements Occam's razor and is closely related to the Kolmogorov complexity of x. However, M assigns high probability to certain data x that are extremely hard to compute. This does not match our intuitive notion of simplicity. Here we suggest a more plausible measure derived from the fastest way of computing data. In absence of contrarian evidence, we assume that the physical world is generated by a computational process, and that any possibly infinite sequence of observations is therefore computable in the limit (this assumption is more radical and stronger than Solomonoff's). Then we replace M by the novel Speed Prior S, under which the cumulative a priori probability of all data whose computation through an optimal algorithm requires more than O(n) resources is 1/n. We show that the Speed Prior allows for deriving a computable strategy for optimal prediction of future y, given past x. Then we consider the case that the data actually stem from a nonoptimal, unknown computational process, and use Hutter's recent results to derive excellent expected loss bounds for S-based inductive inference. Assuming our own universe is sampled from S, we predict: it won't get many times older than it is now; large scale quantum computation won't work well; beta decay is not random but due to some fast pseudo-random generator which we should try to discover. Juergen Schmidhuberhttp://www.idsia.ch/~juergen
the short program
Tim May wrote: One thing that Tegmark got right, I think, is the notion that a lot of branches of mathematics and a lot of mathematical structures probably go into making up the nature of reality. This is at first glance dramatically apposite the ideas of Fredkin, Toffoli, Wheeler1970, and Wolfram on the generation of reality from simple, local rules. Wolfram has made a claim in interviews, and perhaps somewhere in his new book, that he thinks the Universe may be generated by a 6-line Mathematica program! I'd like to point out that it was Konrad Zuse himself (inventor of the computer: http://www.idsia.ch/~juergen/zuse.html ) who was the first to propose that the physical universe is running on a grid of simple computers, each communicating with its neighbors: a cellular automaton. He called this Rechnender Raum, which means Computing Space. As always, Zuse was way ahead of his time. Perhaps the best reference is: Zuse, Konrad: Rechnender Raum, Schriften zur Datenverarbeitung, Band 1. Friedrich Vieweg Sohn, Braunschweig (1969). And here is the simple method for computing all universes, including ours, if it is computable indeed (adapted from page 1 of the 97 paper http://www.idsia.ch/~juergen/everything/html.html ): Order all programs lexicographically. The first is run for one instruction every second step, the next for one instruction every second of the remaining steps, and so on. As a by-product, this program also outputs descriptions of all formally describable mathematical systems, and all proofs of all their theorems. A bit of thought shows that the method even has the optimal order of complexity. For example, it outputs our universe history as quickly as the history's fastest program, save for a (possibly huge) constant that does not depend on output size. Juergen Schmidhuber http://www.idsia.ch/~juergen/
Re: Optimal Prediction
Wei Dai wrote: BTW, isn't the justification for universal prediction taken in this paper kind of opposite to the one you took? The abstract says The problem, however, is that in many cases one does not even have a reasonable guess of the true distribution. In order to overcome this problem ... Your papers on the other hand assume that the true distribution can be known and proposed that it must be the Speed Prior. (Later you said you believe it is the Speed Prior with probability 1.) Is this still your position? Well, I prefer to recall my more cautious fallback position - let us just write down the assumptions, and derive the consequences. One does not really have to know the true prior; one just needs an upper bound on its power. Choose some prior P, and assume the universe is sampled from a prior less dominant than P. Then predictions according to P will be rather accurate. For example, suppose the process computing the universe is not optimally efficient for some reason. As long as the resource postulate holds the true prior cannot dominate the Speed Prior, and S-based predictions will be fine. Juergen
Re: Optimal Prediction
Bill Jefferys wrote: At 10:59 AM +0200 4/3/02, Juergen Schmidhuber wrote: The theory of inductive inference is Bayesian, of course. But Bayes' rule by itself does not yield Occam's razor. By itself? No one said it did. Of course assumptions must be made. At minimum one always has to choose priors in Bayesian inference. Our paper shows that there is a Bayesian interpretation that yields something very suggestive of Ockham's razor. It is appealing in that if one has a simple versus a complex hypothesis, simple meaning that the prior probability is concentrated and complex meaning that it is vague and spread out, simple meaning that you don't have many knobs to tweak, complex meaning the opposite, then the simple hypothesis will be favored over the complex one unless the data lie well away from where the simple hypothesis has placed its bets. Bayesians distinguish this from Ockham's formulation by calling it the Bayesian Ockham's razor, recognizing that it is not what William of Ockham wrote, Entia non sunt multiplicanda sine necessitate (or one of his other genuine formulations). Please don't read more into our article than is there. By itself. First you said that the AP by itself has no predictive power. I missed the by itself so misunderstood you, but when I understood what you were saying I agreed. Now you say that Bayes' rule by itself does not yield Ockham's razor. Jim and I never said that it did. I am hard pressed to see how anything nontrivial relating to the real world can be gotten from any principle by itself, so I don't regard these comments as very profound, or very useful. [Remainder of article snipped] Bill One has to choose priors. Exactly. To repeat the nontrivial point: The only choice you need to obtain Occam's razor is to restrict yourself to priors computable in the limit (this is not much of a restriction in the sense that other priors are beyond constructive description anyway). Then you'll automatically prefer few knobs to tweak, or, more generally, short descriptions of the observations: http://www.idsia.ch/~juergen/toesv2/ Juergen (will be out of town until April 16)
Re: Optimal Prediction
Bill Jefferys wrote: At 9:19 AM +0100 3/27/02, Juergen Schmidhuber wrote: You are claiming the AP necessarily implies a specific fact about nuclear energy levels? I greatly doubt that - can you give a proof? Yes, I can. http://adsabs.harvard.edu/cgi-bin/nph-bib_query?bibcode=1954ApJS1..121Hdb_key=ASThigh=06070 This is the classic paper where Hoyle shows this. It has been cited repeatedly in the literature on the AP as an example of a genuine prediction of the AP. For example, Barrow and Tipler cite it in their book. Being sceptical about proof by citation, I failed to find such a mathematical proof in this paper. Its predictions are based on approximations and assumptions (the authors do acknowlege this) besides the AP. Where exactly do you think is the proof? Predictions that later turn out to be correct demonstrate predictive power of the sum of all assumptions, not of AP by itself. I don't care about flying rabbits. Stick to the issue at hand. This was the issue in the original message. At the risk of beating a dead horse: plain AP does not exclude continuations of your world with different physical laws, as long as your self survives. Who knows - maybe you could live on in a crude simulation that completes ignores the concept of energy levels. To make nontrivial predictions you clearly need more than the AP. The theory of inductive inference provides what is missing: http://www.idsia.ch/~juergen/unilearn.html Juergen
Re: Optimal Prediction
Bill Jefferys wrote: It's pointless wasting my time on this. As both Russell and I pointed out, this is a standard example that is cited by people who are knowledgeable about the AP. Either you have a different definition of predictive power than the rest of us do, or you don't understand Hoyle's very clearly written paper. In either case, it would be foolish of me to continue the thread. Goodbye. plonk Bill Don't try to teach a pig to sing. It wastes your time and annoys the pig. Predictive power is measurable by standard concepts of probability theory and complexity theory. You may choose to ignore this, but don't include all those who don't among the rest of us. Write down all assumptions, derive the consequences, and observe that the AP _by itself_ cannot predict anything nontrivial. Fortunately Hoyle was more careful than some who cite him - he just wrote: the results...were obtained subject to certain assumptions ...have not been demonstrated in a manner free from all doubt. Nevertheless, the number of assumptions made was much less than the number of results obtained. Thus he informally invoked Occam's razor: find short descriptions that explain a lot. Occam's razor is not the AP. It is formally treated by the theory of inductive inference. Although this theory is at the heart of what physicists are doing, some of them are not yet fully aware of it. Juergen(will be out of town for a while)
Optimal Prediction
Many on this list have discussed the anthropic principle. It essentially says that the conditional probability of finding yourself in a universe compatible with your existence equals 1. But unfortunately the anthropic principle does not have any predictive power. It does NOT predict there won't be any flying rabbits tomorrow. Clearly we need more than the anthropic principle to explain the predictability of our universe. In particular, we need an optimal way of predicting the future, given past observations. And there is one! Normally we do not know the true conditional probability distribution p(next event | past). But assume we do know that p is in some set P of distributions. Choose a fixed weight w_q for each q in P such that the w_q add up to 1 (for simplicity, let P be countable). Then construct the Bayesmix M(x) = Sum_q w_q q(x), and predict using M instead of the optimal but unknown p. How wrong is it to do that? The recent exciting work of my postdoc Marcus Hutter (IDSIA) provides general and sharp (!) loss bounds: Let LM(n) and Lp(n) be the total expected losses of the M-predictor and the p-predictor, respectively, for the first n events. Then LM(n)-Lp(n) is at most of the order of sqrt[Lp(n)]. That is, M is not much worse than p. And in general, no other predictor can do better than that! In particular, if p is deterministic, then the M-predictor soon won't make any errors any more. If P contains ALL recursively computable distributions, then M becomes the celebrated enumerable universal prior. That is, after decades of somewhat stagnating research we now have sharp loss bounds for Solomonoff's universal (but incomputable) induction scheme. And if we also include the distributions computable in the limit, we get sharp loss bounds for the even more general priors introduced in Algorithmic Theories of Everything: http://www.idsia.ch/~juergen/toesv2/ Similarly, if we replace M by the Speed Prior S - where S(x) is small if x is hard to compute by any method - we obtain appropriate loss bounds for computable S-based induction: http://www.idsia.ch/~juergen/toesv2/node32.html Alternatively, reduce M to what you get if you just add up weighted estimated future finance data probabilities generated by 1000 commercial stock-market prediction software packages. If only one of them happens to work fine (but you do not know which) you still should get rich. To learn more, please read Optimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet:ftp://ftp.idsia.ch/pub/techrep/IDSIA-02-02.ps.gz and also check out Hutter's other recent papers at ICML, ECML, NIPS, Int. J. of Foundations of CS: www.idsia.ch/~marcus Juergen Schmidhuber http://www.idsia.ch/~juergen/
Re: Does provability matter?
Wei Dai wrote: I don't understand how you can believe that the probability of more dominant priors is zero. That implies if I offered you a bet of $1 versus your entire net worth that large scale quantum computation will in fact work, you'd take that bet. Would you really? Your dollar against my 2 cents worth? I don't want to rip you off! Ok, let me adopt a more careful Bayesian's fallback position: I just assume certain things, then derive consequences, given the assumptions, without precisely quantifying the plausibility of the assumptions themselves, letting the reader decide whether they are plausible or not. Juergen
Re: Does provability matter?
Wei Dai wrote: I'm not sure I understand this. Can you give an example of how our universe might make use of an entire continuum of real numbers? How might someone show this if it were true? I have no idea. In fact, I guess it is impossible. But if there is a formally describable prior that dominates the speed prior, and you agree that the more dominant prior doesn't have a prior probability of zero, then isn't the speed prior redundant? Wouldn't you get equal posterior probabilities (up to a constant multiple) by dropping the speed prior from your prior on priors, no matter what it assigns to priors that are not formally describable? In the Bayesian framework we derive consequences of assumptions represented as priors. The stronger the assumptions, the more specific the predictions. The Speed Prior assumption is stronger than the assumption of a formally describable prior. It is not redundant because it yields stronger predictions such as: The computer computing our universe won't compute much more of it; large scale quantum computation won't work; etc. In fact, I do believe the Speed Prior dominates the true prior from which our universe is sampled (which is all I need to make good computable predictions), and that the probability of even more dominant priors is zero indeed. But as a Bayesian I sometimes ignore my beliefs and also derive consequences of more dominant priors. I do find them quite interesting, and others who do not share my belief in the Speed Prior might do so too. Juergen Schmidhuber http://www.idsia.ch/~juergen/ http://www.idsia.ch/~juergen/everything/html.html http://www.idsia.ch/~juergen/toesv2/
Re: Does provability matter?
Wei Dai wrote: On Thu, Nov 15, 2001 at 10:35:58AM +0100, Juergen Schmidhuber wrote: Why do you prefer the Speed Prior? Under the Speed Prior, oracle universes are not just very unlikely, they have probability 0, right? Suppose one day we actually find an oracle for the halting problem, or even just find out that there is more computing power in our universe than is needed to explain our intelligence. Would you then (1) give up the Speed Prior and adopt a more dominant prior, or (2) would you say that you've encountered an extremely unlikely event (i.e. more likely you're hallucinating)? If you answer (1) then why not adopt the more dominant prior now? Why not adopt a more dominant prior now? I just go for the simplest explanation consistent with available data, where my simplicity measure reflects what computer scientists find simple: things that are easy to compute. You didn't explicitly answer my question about what you would do if you saw something that is very unlikely under the Speed Prior but likely under more dominant priors, but it seems that your implicit answer is (1). In that case you must have some kind of prior (in the Baysian sense) for the Speed Prior vs. more dominant priors. So where does that prior come from? Or let me put it this way. What do you think is the probability that the Great Programmer has no computational resource constraints? If you don't say the probability is zero, then what you should adopt is a weighted average of the Speed Prior and a more dominant prior, but that average itself is a more dominant prior. Do you agree? If someone can use large scale quantum computing to solve a problem whose solution requires much more computation time than necessary to compute the history of our own particular universe, then one could take this as evidence against the Speed Prior. You are right: to quantify this evidence in a Bayesian framework we need a prior on priors. Which one? Hm. Let me extend your question and ask: what's the probability that the Great Programmer is more than a mere programmer in the sense that he is not bound by the limits of computability? For instance, if someone were able to show that our universe somehow makes use of an entire continuum of real numbers we'd be forced to accept some even more dominant prior that is not even computable in the limit. We could not even formally specify it. So what's my prior on all priors? Since the attempt to answer such a question might lead outside what's formally describable, I'll remain silent for now. Juergen Schmidhuber http://www.idsia.ch/~juergen/ http://www.idsia.ch/~juergen/everything/html.html http://www.idsia.ch/~juergen/toesv2/
Re: Does provability matter?
Wei Dai wrote: Thanks for clarifying the provability issue. I think I understand and agree with you. On Tue, Nov 13, 2001 at 12:05:22PM +0100, Juergen Schmidhuber wrote: What about exploitation? Once you suspect you found the PRG you can use it to predict the future. Unfortunately the prediction will take enormous time to stabilize, and you never can be sure it's finished. So it's not very practical. By exploiting the fact that we're in an oracle universe I didn't mean using TMs to predict the oracle outputs. That is certainly impractical. There are a couple of things you could do though. One is to use some oracle outputs to predict other oracle outputs when the relationship between them is computable. The other, much more important, is to quickly solve arbitrarily hard computational problem using the oracles. I prefer the additional resource assumptions reflected by the Speed Prior. They make the oracle universes very unlikely, and yield computable predictions. Why do you prefer the Speed Prior? Under the Speed Prior, oracle universes are not just very unlikely, they have probability 0, right? Suppose one day we actually find an oracle for the halting problem, or even just find out that there is more computing power in our universe than is needed to explain our intelligence. Would you then (1) give up the Speed Prior and adopt a more dominant prior, or (2) would you say that you've encountered an extremely unlikely event (i.e. more likely you're hallucinating)? If you answer (1) then why not adopt the more dominant prior now? You are right in the sense that under the Speed Prior any complete _infinite_ oracle universe has probability 0. On the other hand, any _finite_ beginning of an oracle universe has nonvanishing probability. Why? The fastest algorithm for computing all universes computes _all_ finite beginnings of all universes. Now oracle universes occasionally require effectively random bits. History beginnings that require n such bits come out very slowly: the computation of the n-th oracle bit requires more than O(2^n) steps, even by the fastest algorithm. Therefore, under the Speed Prior the probabilities of oracle-based prefixes quickly converge towards zero with growing prefix size. But in the finite case there always will remain some tiny epsilon. Why not adopt a more dominant prior now? I just go for the simplest explanation consistent with available data, where my simplicity measure reflects what computer scientists find simple: things that are easy to compute. Juergen Schmidhuber http://www.idsia.ch/~juergen/ http://www.idsia.ch/~juergen/everything/html.html http://www.idsia.ch/~juergen/toesv2/
Re: Does provability matter?
Wei Dai wrote: On Wed, Oct 31, 2001 at 10:49:41AM +0100, Juergen Schmidhuber wrote: Which are the logically possible universes? Tegmark mentioned a somewhat vaguely defined set of ``self-consistent mathematical structures,'' implying provability of some sort. The postings of Marchal also focus on what's provable and what's not. Do you understand Marchal's postings? I really have trouble making sense of them. If you do, would you mind rephrasing his main ideas for me? No, unfortunately I do not understand him. But provability does seem important to him. Maybe someone else could post a brief and concise summary? Is provability really relevant? Philosophers and physicists find it sexy for its Goedelian limits. But what does this have to do with the set of possible universes? What does it mean for a universe to be provable? One interpretation is that there exists an algorithm to compute any finite time-space region of the universe that always halts. Is that what you mean? Well, any finite object x has a halting algorithm that computes it. So which are the unprovable things? They are formal statements such as algorithm x will eventually compute object y. If it does then you can prove it, otherwise in general you can never be sure. But why identify universes with true or false statements? When I'm asking Is provability really relevant? I am not really referring to provable universes - I don't even know what that is supposed to be. I just mean: why should it be important to show that a particular algorithm computes a particular universe? Or that a particular universe has a particular property (there are many generally unprovable properties)? Let's compute them all - why worry about provability? The provability discussion seems to distract from the real issue. If we limit ourselves to universes corresponding to traditionally provable theorems then we will miss out on many formally and constructively describable universes that are computable in the limit yet in a certain sense soaked with unprovability. Example: a never ending universe history h is computed by a finite nonhalting program p. To simulate randomness and noise etc, p invokes a short pseudorandom generator subroutine q which also never halts. The n-th pseudorandom event of history h is based on q's n-th output bit q(n) which is initialized by 0 and set to 1 as soon as the n-th statement in an ordered list of all possible statements of a formal axiomatic system is proven by a theorem prover that systematically computes all provable theorems. Whenever q modifies some q(n) that was already used in the previous computation of h, p appropriately recomputes h since the n-th pseudorandom event. Living in this universe is like living in a universe with a halting-problem oracle, right? If you want to know whether program x halts, you just measure the n-th pseudorandom event, where n is such that the n-th theorem is x halts. Yes? Yes. Once the history has stabilized this will give you the correct answer. The problem with universes with halting-problem oracles (which I'll just call oracle universes) is that either the oracle is not exploitable (for example if no one in the universe figures out that the pseudorandom events actually correspond to theorems) in which case the universe just looks like an ordinary provable universe, or the oracle is exploitable, in which case the outcome becomes completely unimaginable for us. (Or is that not true? Can we imagine what would happen if an intelligent species came across an exploitable halting-problem oracle?) But why should this lack of provability matter? Ignoring this universe just implies loss of generality. Provability is not the issue. If oracle universes could exist, how significant are they? What's the probability that we live in such a universe, and how could we try to determine and/or exploit this fact? If the computer on which the history of our universe is calculated is a general Turing machine (as opposed to a monotone machine which cannot edit previous outputs), and if we make no resource assumptions such as those embodied by the Speed Prior, then the oracle universes indeed represent a large fraction of the probability mass of all universes. Why? Because there are many oracle universes with very short algorithms. How to determine whether we are in an oracle universe? In principle it is possible, at least to the extent inductive inference is possible at all! Just search through all oracle pseudo-random generators (PRGs) and try to match their outputs with the apparently noisy observations. If the observations are not really random, eventually you'll stumble across the right PRG. It's just that you cannot prove that it is the right one, or that its outputs have already stabilized. What about exploitation? Once you suspect you found the PRG you can use it to predict the future. Unfortunately the prediction
Does provability matter?
Cannibalizing previous thread Provable vs Computable: Which are the logically possible universes? Tegmark mentioned a somewhat vaguely defined set of ``self-consistent mathematical structures,'' implying provability of some sort. The postings of Marchal also focus on what's provable and what's not. Is provability really relevant? Philosophers and physicists find it sexy for its Goedelian limits. But what does this have to do with the set of possible universes? The provability discussion seems to distract from the real issue. If we limit ourselves to universes corresponding to traditionally provable theorems then we will miss out on many formally and constructively describable universes that are computable in the limit yet in a certain sense soaked with unprovability. Example: a never ending universe history h is computed by a finite nonhalting program p. To simulate randomness and noise etc, p invokes a short pseudorandom generator subroutine q which also never halts. The n-th pseudorandom event of history h is based on q's n-th output bit q(n) which is initialized by 0 and set to 1 as soon as the n-th statement in an ordered list of all possible statements of a formal axiomatic system is proven by a theorem prover that systematically computes all provable theorems. Whenever q modifies some q(n) that was already used in the previous computation of h, p appropriately recomputes h since the n-th pseudorandom event. Such a virtual reality or universe is perfectly well-defined. We can program it. At some point each history prefix will remain stable forever. Even if we know p and q, however, in general we will never know for sure whether some q(n) that is still zero won't flip to 1 at some point, because of Goedel etc. So this universe features lots of unprovable aspects. But why should this lack of provability matter? Ignoring this universe just implies loss of generality. Provability is not the issue. Juergen Schmidhuber http://www.idsia.ch/~juergen/ http://www.idsia.ch/~juergen/everything/html.html http://www.idsia.ch/~juergen/toesv2/
Re: Predictions duplications
From: Russell Standish [EMAIL PROTECTED] The only reason for not accepting the simplest thing is if it can be shown to be logically inconsistent. This far, you have shown no such thing, but rather demonstrated an enormous confusion between measure and probability distribution. Russell, this is really too much; please read any intro to measure theory, e.g., the one by Halmos. Measures in general are _not_ probability distributions. They are more than that; you can view them as _sets_ of probability distributions on sets that are related to each other in a specific way. Probability density functions define measures and therefore in general aren't probability distributions either. The uniform PDF on [0..1] yields the perfect trivial example; that's exactly why I used it before. In the computability context, compare LI/Vitanyi 1997, chapters 4.3 (discrete sample space with (semi)distribution m(x)) vs chapter 4.5 (continuous sample space with (semi)measure M(x)). This is an excellent reference, thank you. Example 4.2.1 gives the construction of the uniform measure over the set of strings, precisely what you claim didn't exist when we started this discussion. You just want to annoy me, don't you? For the last time: There is a uniform *measure* on the strings, but no uniform probability distribution. 1) Halting theorem implies the set of halting programs is of measure zero within the set of all programs You mean, given a uniform measure? But why should you need the halting theorem for this trivial statement? In fact, what exactly do you mean by halting theorem? (Schoenfield's book provides a good intro to mathematical logic and computability theory.) The halting theorem is that there is no algorithm for predicting whether any program will halt. Of course, the result I was referring to is that the set of compressible strings is of measure zero. The set of halting programs actually has finite measure, since \Omega 0. However, there is still finite probability of an input string being nonhalting, and indeed the probability seems to be greater than 0.7 (ie \Omega 0.217643 - see pg 218 Li and Vitanyi), so this still presents a problem for computationalism. This is mixing up everything, in particular, different measures. 1) Infinite strings with finite program have *nonzero* measure, given the *universal* measure (Omega is a number derived from the universal measure). 2) They have measure zero, given the *uniform* measure (Omega has nothing to do with it). 3) To see 2) we do not need the halting theorem - we just compare the finite programs (countably many) to all strings (uncountably many) and we are done. 4) The halting theorem is no problem whatsoever when it comes to computable universes. Why care whether some computable universe stops or not? If it does, it does, otherwise it does not. In both cases we can compute it in the limit. The GP program (aka Universal Dovetailer) is not the simplest thing. The simplest describable thing is the set of all descriptions, that simply exist without being generated. That has precisely zero information or complexity, whereas the UD has some complexity of the order of a few 100 bits. The simplest prior is the uniform one, ie there is no bias whatsoever in the selection. Exist without being generated? Precisely zero information or complexity? But with respect to what? With respect to the observer. With the respect to anything in fact. The set of all possible descriptions has precisely zero information, by definition, regardless of what observer or context you employ. This is a vague informal statement, not a definition. Which are the possible descriptions? What's the formal difference between a description and a non-description? What is your anything if there is no device for describing it formally? If anything does not convey any bit of info then what exactly is it that conveys one bit? Or two? Any traditional definition of information/simplicity requires the specification of a formal description method, such as a Turing machine. You don't need one? Then how do you define information or complexity? How exactly do you define simple ? Actually, all it needs is an equivalencing procedure. See my On Complexity and Emergence paper. A Turing machine will do the job for you, but so will a neural network or an expert system following heuristic rules. So you have to write down formally this equivalencing procedure, right? Why should this be different in principle from writing down the formal rules for a Turing machine? Why is a simplicity measure that depends on your equivalencing procedure superior to a simplicity measure depending on a Turing machine? (The TM procedure includes yours - because it includes all procedures - but not the other way round.)
Re: Predictions duplications
From: Juho Pennanen [EMAIL PROTECTED] So there may be no 'uniform probability distribution' on the set of all strings, but there is the natural probability measure, that is in many cases exactly as useful. Sure, I agree, measures are useful; I'm using them all the time. But in general they are _not_ the same thing as probability distributions. From: Russell Standish [EMAIL PROTECTED] The only reason for not accepting the simplest thing is if it can be shown to be logically inconsistent. This far, you have shown no such thing, but rather demonstrated an enormous confusion between measure and probability distribution. Russell, this is really too much; please read any intro to measure theory, e.g., the one by Halmos. Measures in general are _not_ probability distributions. They are more than that; you can view them as _sets_ of probability distributions on sets that are related to each other in a specific way. Probability density functions define measures and therefore in general aren't probability distributions either. The uniform PDF on [0..1] yields the perfect trivial example; that's exactly why I used it before. In the computability context, compare LI/Vitanyi 1997, chapters 4.3 (discrete sample space with (semi)distribution m(x)) vs chapter 4.5 (continuous sample space with (semi)measure M(x)). 1) Halting theorem implies the set of halting programs is of measure zero within the set of all programs You mean, given a uniform measure? But why should you need the halting theorem for this trivial statement? In fact, what exactly do you mean by halting theorem? (Schoenfield's book provides a good intro to mathematical logic and computability theory.) The GP program (aka Universal Dovetailer) is not the simplest thing. The simplest describable thing is the set of all descriptions, that simply exist without being generated. That has precisely zero information or complexity, whereas the UD has some complexity of the order of a few 100 bits. The simplest prior is the uniform one, ie there is no bias whatsoever in the selection. Exist without being generated? Precisely zero information or complexity? But with respect to what? Any traditional definition of information/simplicity requires the specification of a formal description method, such as a Turing machine. You don't need one? Then how do you define information or complexity? How exactly do you define simple ?
Re: Predictions duplications
Schmidhuber: It's the simplest thing, given this use of mathematical language we have agreed upon. But here the power of the formal approach ends - unspeakable things remain unspoken. Marchal: I disagree. I would even say that it is here that the serious formal approach begins. Take unprovable for unspeakable, we can meta-reason (informally or formally) and study the logical structure of the set of unprovable sentences by sound universal machines. Schmidhuber: Unprovable does not mean unspeakable! You can spell out many unprovable things. Why care for the subset of provable sentences? Aren't we interested in the full set of all describable sentences? We can generate it, without caring for proofs at all. Unspeakable means beyond formal definition. In particular, the way we humans communicate and agree on a common formal language is not formally defined. You write x and I say, ok, it's an x, but that's all based on some sort of informal agreement that involves complex pattern recognition and learning processes. Usually we do not question the validity of outcomes of such cognitive processes, and just go ahead with formal derivations. But there is this major informal step *before* formal reasoning can start, and good textbooks on logic acknowledge this.
Re: everything priors
Hi Max, 2) If so, should we really limit ourself to this particular kind of mathematical structures? My concern is that we may be a bit too narrow-minded if we do. But this sort of narrow-mindedness seems necessary to remain within the formally describable realm. I'd go beyond computable structures only if forced by evidence, e.g., if someone shows our universe somehow won't run without all the real numbers. But for now there isn't any evidence in this vein. Wait a sec: there's also no evidence that our particular universe has seven spatial dimensions or a proton/electron mass ratio different from 1836. But we're considering the whole ensemble here. Right. Our motivation is that the ensemble is compatible with existing data yet simpler than the special case. But Occam's razor also motivates us to consider just the computable universes as opposed to those based on, say, non-computable real numbers. The ensemble of computable universes is both compatible with known data AND simple. The ensemble of non-computable universes is not! Cheers, Juergen
Re: Turing vs math
Step n owns 2^(n-1) initial segments. Bruno, why are we discussing this? Sure, in finite time you can compute all initial segments of size n. In countable time you can compute one real, or a countable number of reals. But each of your steps needs more than twice the time required by the previous step. Therefore you need more than countable time to compute all reals. Now, could you give me a bitstring which is not generated by this countably infinite process ? Sure. The output of the program While TRUE print 1 won't be computed in countable time by your procedure. Juergen
Re: Turing vs math
Why assume non-computable stuff without compelling reason? Shaved by Occam's razor. Jacques: On the contrary. Why assume the lack of *any* given type of mathematical stucture? A true everything-hypothesis surely would not. Occam's razor says: don't add extra distinctions such as a restriction like that. Note also that, as I said, computability isn't the real issue. A Turing machine can not be a continuous (but computable) structure. Of course the non-computable stuctures should exist too in an everything - hypothesis. The non-computable structures are just an ill-defined fidget of our imagination. They do not exist in the sense that we cannot formally describe them with a finite number of bits. Textbooks and theorems about real numbers are computable (finite symbol strings, finite proofs), most real numbers are not. Occam's razor really says: do not add any bits beyond those necessary to explain the data. Observed data does not require more than a finite number of bits, and never will. Non-computability is not a restriction. It is an unnecessary extension that greatly complicates things, so much that we cannot even talk about it in a formal way. Juergen Schmidhuber www.idsia.ch
Re: Turing vs math
Jacques Mallah wrote: I agree with Alistair Malcolm that Turing machines are not enough A continuous structure is a perfectly good mathematical structure, but no Turing based scheme can include it. Why assume non-computable stuff without compelling reason? Shaved by Occam's razor. Juergen Juergen Schmidhuber www.idsia.ch
Re: Turing vs math
Bruno: Honestly it is still not clear. How could ever S(U)=TRUE be computable ? As a computer scientist I guess you know that even the apparently simple question does that piece of code computes the factorial function is not computable. Sure, it's not even decidable in general whether a given program will produce any output at all. Once the output is there, though, you can easily decide whether it consists of, say, prime numbers. In absence of evidence to the contrary we assume that the presence of your consciousness (whatever that may be) is detectable by a computable process, namely, the one employed by you, the observer, who decides that he exists, via some sort of computation taking place as part of the evolution of his universe. George: Juergens seems to be talking as if the measure of information is absolute. It is not. Information is always conditional information (Shannon), where the condition is the observer's a-priori database of information. Sure, but in the Kolmogorov framework the role of prior knowlwdge vanishes in the limit. Each prior corresponds to a particular Turing machine TM with particular inbuilt prior knowledge. But TM2 can simulate TM1 using a compiler whose size is independent of the information content of the universes they compute. For all U, K1(U)=O(K2(U)), where K1(U) and K2(U) are U's information contents relative to T1 and T2. Similarly P1(U)= O(P2(U)), where P1 and P2 are the respective probabilities given Solomonoff-Levin distributions for T1 and T2. There is a good reason why Solomonoff-Levin distributions are called universal distributions! Hal: We then invoke the principle that large-program universes are inherently less likely than small-program universes, and presto! we have it more likely that we live in a universe without flying rabbits, without magic, etc. That's the general argument we are striving to achieve. I agree. That's precisely one of the major points of the 1997 paper. I do think that this argument has some problems, but it is appealing and if the holes can be filled it seems to offer an answer to the question. What do you think? Where exactly are the holes? Juergen
Re: Turing vs math
Bruno wrote: I don't take the notion of observer for granted. Neither do I, of course. The observer O is something computable that evolves in some universe U. The problem is that to be in a universe has no clear meaning But it does. There is a computable predicate S such that S(U)=TRUE if O inhabits U. Fortunately, there is no need for us to further specify what it means to ``inhabit,'' to be ``conscious'' or ``self-aware,'' or whether there is some other observer who applies S to U and uses S(U) to identify O, etc. Now we have a formally defined, conditional probability distribution on all universes satisfying S. I thought this to be clear, but maybe I should have written it down explicitly. Juergen