SOLOMONOFF MACHINES UP CLOSE AND PERSONAL This is in response to Shane Leggs post of Fri 11/9/2007 9:40 AM and Lukasz Stafiniaks posts of Fri 11/9/2007 7:13 AM and Fri 11/9/2007 12:09 PM.
==Preface==================== Here are a list of questions I have from reading these very helpful posts. I have placed most of my questions under language copied from Lukaszs two posts, because of its question and answer nature is proved more heading like quotes, and because many of the questions I had in response to Shanes post fit under such headings. In the following ED######> preceded by >> indicates a question or statement from me that Lukasz was responding to. A divider ==Q1======================== precedes each portion of text quoted from one of Lukaszs or Shanes posts, or which starts a different area of questioning by me. These dividers could be used to indicate your subject matter if you chose to make your response to one or a few of them at a time. Shane, I would be interested in reading your PhD thesis. (But I dont know how much of it I will be able to understand.) Dont feel obligated to answer all of this at once or even ever. But I would appreciate as many answers as possible ==Q1======================== >Lukasz######> Kolmogorov complexity is "universal". For probabilities, you need to specify the probability space and initial distribution over this space. ED######> I understand the importance of this: i.e. there are many different types of distributions, binary, one-of-n, integer-in-range [x,y], complex-in-range [x+ai, y+bj], vector-in-n-dimensional-space, distributions-of-distributions, etc. So is the real significance of the universal prior, not its probability value given in a given probability space (which seems relatively unimportant, provided is not one or close to zero), but rather the fact that it can model almost any kind of probability space? How does the Kolmogorov complexity help deal with this problem? Is it because, since its major form of representation is programming on a Turing-equivalent machine, it has the capacity to represent virtually any type of space that can be represented by a computer. It seems to me just a system that could remember significant portions of its sensory inputs and machines states and that could match against them (such as Novamente) could have something approaching the same generality for a given degree of computing power, minus the, at least to me, not totally well understood potential magic of program based representation) But how does a Solomonoff machine in the real world, decide what type of probability space should be used for a given distribution? And how does it decide what is to be treated as a variable? It seems to me before it selects a probability space for a variable it first has to decide what it considers to be a given variable (other than when dealing with primiate sensory inputs). That is, you can develop a model for variation unless you know what it is thats varying. It is my feeling that any generalization is a variable, because it can have multiple instantiations, so it seems to me that you have to build up generalizations, before you can have variables. For example, the set of integers is a set defined by a generalization about countable value. It seems to me that the variations of instances of the generalization define possible dimensions for the associated probability space. I assume the Solomonoff machine defines generalizations as covering patterns that can be produced by programs that are similar, and one such measure of similarity is the Normalized Compression Distance, Shane turned me on to. Presumably some measure of importance is involve in deciding what patterns are important, unless you are in the la-la land of unlimited computational resources. How are probabilities represented? I assume once a space has been picked you just use normal Bayesian probabilities using the Solomonoff prior (which again seem to be much less important than the picking of the space provided it is not a clearly stupid value.) ==Q2======================== >> ED######> So are the programs just used for computing Kolmogorov complexity or are they also used for generating and matching patterns. >Lukasz######> It is difficult to say: in AIXI, the direct operation is governed by the expectimax algorithm, but the algorithm works "in future" (is derived from the Solomonoff predictor). Hutter mentions alternative model AIXI_alt, which models actions the same way as the environment... ED######> ??Shane??, what are the major ways programs are used in a Solomonoff machine? Are they used for generating and matching patterns? Are they used for generating and creating context specific instantiations of behavioral patterns? Is not the expectimax algorithm not just a standard serial algorithm for reinforcement learning, except that it has: a recording of all its entire history; and the ability to search all possible sequences of possible actions based, presumably, on perfect induction of probabilities from that history? How would it be wed with the Solomonoff machine. ==Q3======================== >Lukasz######> It is automatic: when you have a program with a good enough match, then you can "parameterize" it over the difference and apply twice, thus saving the code. ED######> what do you mean by parameterize it, do you mean (a) tweak parameters in the code so thats it changes its the manifestation of a program, (b) that extra or different code is added to make up all the differences, (c) tune it to accept less than totally correct matches, or (d) all of the above. ==Q4======================== >Lukasz######> Remember that the programs need to represent the whole history. ED######> Do, you mean that the library of all the programs must be able to represent the whole history? Do you also mean that there is one program that represents the whole program? ==Q5======================== >>ED######> Can the programs learn that similar but different patterns are different views of the same thing? Can they learn a generalizational and compositional hierarchy of patterns? >Lukasz######> With an egzegetic enough interpretation... ED######> Re: learning different views of similar things, it is generally believed that in humans much of such learning of the similarity of differing views is learned by experiencing spatial and temporal continuity between sequences of view transformations. If your system is any good at generalized pattern learning it should be able to learn a generality for things that have temporal and spatial continuity and the importance of that generality. It is one of the key characteristics of most physical objects. Re: learning generalization and compositional pattern hierarchies, I, Novamente, and I assume a large number of other people have some pretty good ideas about how to do that. Re: egzegetic There was only one, repeat one, hit for egzegetic on Google, and I think it was a reference to another paper. So what does it mean -- exegesis? Do you meant if had enough ability to draw meaning out of the machines history well enough? ==Q6======================== >Lukasz######> these programs will be compact description o[f] data when enough data gets collected, so their (posterior) probability will grow with time. But the most probable programs will be very cryptic, without redundancy to make the structure evident. ED######> I dont understand how probabilities are used in a solomonoff machine. But I assume it is as I said under heading 1 above. ==Q7======================== >Lukasz######> The programs do not compute K complexity, they (their length) _are_ (a variant of) Kolmogorov complexity. The programs compute (predict) the environment. ED######> Yes, about length of programs relating to complexity. Does the machine also learn by predicting and comparing predictions to outcomes? Why would the more frequent programs or pats, become more compactly represented? Do you mean a more efficient rep would be made for a repeating pattern, because (a) more time would be spent trying to compress them, or (b) that the plurality of repeated patterns corresponding to a given generalization would be more compactly represented because there would be common generalization that could with a little tweaking be used to represent all of them. ==Q8======================== >Lukasz######> The programs are generally required to exactly match in AIXI (but not in AIXItl I think). ED######> ??Shane??, could you please give us an assist on this one? Is exact matching required? And if so, is this something that could be loosened in a real machine? ==Q9======================== >>ED######> ...does it know when a match is good enough that it can be relied upon as having some significance? >Lukasz######> ...the "significance" is provided by the compression on representation of similar things, which favors the same sort of similarity in the future. ED######> Why does compression favor representation of similar things? Is Kolmogorov complexity used to determine probability for purposes of inference or is that done by determining frequencies of occurrences of similar patterns using Bayesian probabilities. ==Q10======================== >>ED######> Can they run on massively parallel processing. >Lukasz######> I think they can... In AIXI, you would build a summation tree for the posterior probability. ED######> are you referring to a summation tree that would sum over the occurrence of similar patterns at different epochs in past history to determine how often they have occurred and how often they have occurred in similar patterns. ==Q11======================== >Lukasz######> To be optimal, the expectimax must be performed chronologically from the end of the horizon (dynamic programming principle: close to the end of the time horizon, you have smaller planning problems -- less opportunities; from smaller solutions to smaller problems you build bigger solutions backwards in time). ED######> Is the horizion (a) the present, or (b) as far in the future as the machine is going to think? I presume it is (b). Can the expectimax algorithm both forward and backward chain? What control over its own behavior does the expectimax have, such as switching between forward and backward chaining, or focusing of attention, or does it avoid all such issues by claiming it will compute it all in all ways? ==Q12======================== >> ED######>> are these short codes sort of like Wolfram little codelettes, that can hopefully represent complex patterns out of very little code, or do they pretty much represent subsets of visual patterns as small bit maps. >Lukasz######> It depends on reality, whether the reality supports Wolfram's hypothesis. ED######> By Wolfram codelettes, I did not mean game of Life codelets, I meant compact programs that could represent a lot with a little. In one of his published papers Ben Goertzel said that MOSES, which is an evolutionary programming component of Novamente, is good at learning modest size programs. But is not good at learning a big program from scratch because the search space is explodes as problem complexity does. But I have no feeling for what size problems it can attach at what speed with what hardware. But MOSES is capable of compositional learning over time, so that after it learns a lot of little codelettes, it could then use them as pieces in bigger codelettes. I have no idea how efficient it is to find computer programs that can represent patterns much more efficiently than a literal recording (other than using standard data compression techniques). -- other than by building up a gen/comp hierarchy of patterns rather literally derived from patterns below them, as the brain does. This is what Jeff Haskins calls hierarchical memory. A good example of which is described, with regard to the visual what system, in Serres paper I have cited so many times before ( http://cbcl.mit.edu/projects/cbcl/publications/ps/MIT-CSAIL-TR-2006-028.pd f ). Isnt there a large similarity between a Solomonoff machine that could learn a hierarchy of pattern representing programs and Jeff Hawkings hierarchical learning (as represented in the Serre paper). One could consider the patterns at each level of the higherarchy as sub-routines. The system is designed to increase its representational efficiency by having representational subroutines available for use by multiple different patterns at higher compositional levels. To the extent that a MOSES-type evolutionary system could be set to work making such representations more compact, it would become clear how semi-Solomonoff machines could be made to work in the practical world. ==Q13======================== ED######> How is pattern matching done in a Solomonoff machine? Are all the pattern representing programs run against the current or past input to be analyzed? Is parallelism used, and if so, could it not be used in a manner to the pattern matching in the Serre paper. ==Q14======================== ED######> How does a Solomonoff machine take into account the difference between complexity of occurrence-and-complexity of observation? I assume it could have different pattern modeling components that model these different types of complexity, and then allowing you to factor out the complexities of observation when trying to calculate the complexities of observation? Shane said a machine trying to interpret a stream of video would have to be trained up on video input to be able to find structure in the world observed by such video ==Q15======================== ED######> Is a Solomonoff machine simply a machine that uses Solomonoff induction? Or, as from the above discussion, is it much more? Is there any commonly understood definition for Solomonoff machine and, if so, what is it? The def of Solomonoff induction on the web and even in Shane Leggs paper Solomonoff induction make it sound like it is merely Bayesian induction, using the picking of priors based on Kolmogorov complexity. But statements made by Shane and Lukasz appears to imply that a Solomonoff machine uses programming and programming size as a tool for pattern representation, generalization, learning, inference, and more. ==Q16======================== >Shane######>...our prior knowledge of the world strongly biases how we interpret new information. So, to use your example, we all know that people living on a small isolated island are probably genetically quite similar to each other. Thus, if we see that one has brown skin, we will guess that the others probably do also. However, weight is not so closely tied to genetics, and so if one is obese then this does not tell us much about how much other islanders weigh. Out-of-the-box a Solomonoff machine doesn't know anything about genetics and weight, so it can't make such inferences based on seeing just one islander. However, if it did have prior experience with genetics etc., then it too would generalise as you describe using context. Perhaps the best place to understand the theory of this is section 8.3 from "An Introduction to Kolmogorov complexity and its Applications" by Li and Vitanyi. You can also find some approximations to this theory that have been applied in practice to many diverse problems under the title of "Normalized Compression Distance" or NCD. A lot of this work has been done by Rudi Cilibrasi. ED######> I found what appeared to be a good def of NCD at http://www.complearn.org/ncd.html, i.e., NCD(x,y) = C(xy)-min{C(x),C(y)} / max {C(x),C(y)}, where C(x) is a complexity measure for which the Kolmogorov complexity can be used. So I think (but I could well be wrong) I know what that means. Unfortunately I am a little fuzzy about whether NCD would take what information, what-with-what or binding information, or frequency information sufficiently into account to be an optimal measure of similarity. Is this correct? If so, it seems, to me one might have a better measure of similarity if it actually included such additional types of information. If any of these types of information are included in NCD, please explain how? It appears (at least to me) that a Solomonoff machine is going to have to develop similarity measures based on more than complexity alone to do any sort of sophisticated pattern matching, for tasks involving learning from video. So, it seems to me that similarity measures capable of using such more sophisticated similarity information could be developed by a Solomonoff machine. Is the correct or totally off base? Thank you for whatever enlightenment you can give me and other readers on this list by answering these questions. Ed Porter (617) 494-1722 Fax (617) 494-1822 [EMAIL PROTECTED] ----- This list is sponsored by AGIRI: http://www.agiri.org/email To unsubscribe or change your options, please go to: http://v2.listbox.com/member/?member_id=8660244&id_secret=63797694-101495