Hey. I didn't even like this thread. I'll be right back On 1/2/09, Abram Demski <[email protected]> wrote: > Steve, > > I'm thinking that you are taking "understanding" to mean something > like "identifying the *actual* hidden variables responsible for the > pattern, and finding the *actual* state of that variable". > Probabilistic models instead *invent* hidden variables, that happen to > help explain the data. Is that about right? If so, then explaining > what I mean by "functionally equivalent" will help. Here is an > example: suppose that we are looking at data concerning a set of > chemical experiments. Suppose that the experimental conditions are not > very well-controlled, so that interesting hidden variables are > present. Suppose that two of these are temperature and air pressure, > but that the two have the same effect on the experiment. Then the > unsupervised learning will have no way of distinguishing between the > two, so it will only find one hidden variable representing them. So, > they are functionally equivalent. > > This implies that, in the absence of further information, the best > thing we can do to try to "understand" the data is to > probabilistically model it. > > Or perhaps when you say "understanding" it is short for "understanding > the implications of", ie, in an already-present model. In that case, > perhaps we could separate the quality of predictions from the speed of > predictions. A complicated-but-accurate model is useless if we can't > calculate the information we need quickly enough. So, we also want an > "understandable" model: one that doesn't take too long to create > predictions. This would be different than looking for the best > probabilistic model in terms of prediction accuracy. On the other > hand, it is irrelevant in (practically?) all neural-network style > approaches today, because the model size is fixed anyway. > > If the output is being fed to humans rather than further along the > network, as in the conference example, the situation is very > different. Human-readability becomes an issue. This paper is a good > example of an approach that creates better human-readability rather > than better performance: > > http://www.stanford.edu/~hllee/nips07-sparseDBN.pdf > > The altered algorithm also seems to have a performance that matches > more closely with statistical analysis of the brain (which was the > research goal), suggesting a correlation between human-readability and > actual performance gains (since the brain wouldn't do it if it were a > bad idea). In a probabilistic framework this is represented best by a > prior bias for simplicity. > > --Abram > > On Fri, Jan 2, 2009 at 1:36 PM, Steve Richfield > <[email protected]> wrote: >> Abram, >> >> Oh dammitall, I'm going to have to expose the vast extent of my >> profound ignorance to respond. Oh well... >> >> On 1/1/09, Abram Demski <[email protected]> wrote: >>> >>> Steve, >>> >>> Sorry for not responding for a little while. Comments follow: >>> >>> >> >>> >> PCA attempts to isolate components that give maximum >>> >> information... so my question to you becomes, do you think that the >>> >> problem you're pointing towards is suboptimal models that don't >>> >> predict the data well enough, or models that predict the data fine but >>> >> aren't directly useful for what you expect them to be useful for? >>> > >>> > >>> > Since prediction is NOT the goal, but rather just a useful measure, I >>> > am >>> > only interested in recognizing >>> > that which can be recognized, and NOT in expending resources on >>> > "understanding" semi-random noise. >>> > Further, since compression is NOT my goal, I am not interested in >>> > combining >>> > features >>> > in ways that minimize the number of components. In short, there is a >>> > lot >>> > to >>> > be learned from PCA, >>> > but a "perfect" PCA solution is likely a less-than-perfect NN solution. >>> >>> What I am saying is this: a good predictive model will predict >>> whatever is desired. Unsupervised learning attempts to find such a >>> model. But, a good predictive model will probably predict lots of >>> stuff we aren't particularly interested in, so supervised methods have >>> been invented to predict single variables when those variables are of >>> interest. Still, in principle, we could use unsupervised methods. >>> Furthermore (as I understand it), if we are dealing with lots of >>> variables and believe deep patterns are present, unsupervised learning >>> can outperform supervised learning by grabbing onto patterns that may >>> ultimately lead to the desired result, which supervised learning would >>> miss because no immediate value was evident. But, anyway, my point is >>> that I can only see two meanings for the word "goodness": >>> >>> --usefulness in predicting the data as a whole >>> --usefulness in predicting reward in particular (the real goal) >> >> >> I'm still hung up on "predicting", which may indeed be the best measure of >> value, but AGI efforts need understanding, which is subtly different. OK, >> so >> what is the difference? >> >> The tree of reality has many branches in the future - there are many >> possible futures. "Understanding" is the process of keeping track of which >> branch you are on, while "predicting" is taking shots at which branch will >> prevail. One may necessarily involve the other. Has anyone thought >> this through yet? >>> >>> (Actually, I can think of a third: usefulness in *getting* reward (ie, >>> motor control). But, I feel adding that to the discussion would be >>> premature... there are interesting issues, but they are separate from >>> the ones being discussed here...) >>> >>> >> >>> >> To that end... you weren't talking about using the *predictions* of >>> >> the PCA model, but rather the principle components themselves. The >>> >> components are essentially hidden variables to make the model run. >>> > >>> > >>> > ... or variables smushed together in ways that may work well for >>> > compression, but poorly for recognition. >>> >>> What are the variables that you keep worrying might be smushed >>> together? Can you give an example? >> >> >> I thought I could, but then I ran into problems as you discussed below. >>> >>> If PCA smushes variables together, >>> that suggests 1 of 3 things: >>> >>> --PCA found suboptimal components >> >> >> Here, I am hung up on "found". This implies a multitude of "solutions", >> yet >> there are guys out there who are beating on the matrix manipulations to >> "solve" PCA. Is this like non-zero-sum game theory, where there can be >> many >> solutions, some better than others? >>> >>> --PCA found optimal components, but the hidden variables that got >>> smooshed really are functionally equivalent (when looked at through >>> the lens of the available visible variables) >> >> >> Here, I am hung up on "functionally". This presumes supervised learning or >> divine observation. >>> >>> --The true probabilistic situation violates the probabilistic >>> assumptions behind PCA >>> >>> The third option is by far the most probable, I think. >> >> >> That's where I got stuck trying to come up with an example. >>> >>> >> >>> >> or in an attempt to complexify the model to make it more accurate in >>> >> its predictions, by looking for links between the hidden variables, or >>> >> patterns over time, et cetera. >>> > >>> > >>> > Setting predictions aside, the next layer of PCA-like neurons would be >>> > looking for those links. >>> >>> Absolutely. >> >> >> More on my ignorance... >> >> I and PCA hadn't really "connected" until a few months ago, when I >> attended >> a computer conference and listened to several presentations. The (possibly >> false, at least in some instances) impression I got was that the >> presenters >> didn't really understand some/many of the "components" that they were >> finding. One video compression presenter did identify the first few, but >> admittedly failed to identify later components. >> I can see that this process necessarily involves a tiny amount of a priori >> information, specifically, knowledge of: >> 1. The physical extent of features, e.g. as controlled by mutual >> inhibition. >> 2. The threshold for feature recognition, e.g. the number of active >> synapses that must be involved for a feature to be interesting. >> 3. The acceptable "fuzziness" of recognition, e.g. just how accurately >> must >> a feature match its "pattern". >> 4. ??? What have I missed in this list? >> 5. Some or all of the above may be calculable based on ??? >> >> Thanks for your help. >> >> Steve Richfield >> >> ________________________________ >> agi | Archives | Modify Your Subscription > > > > -- > Abram Demski > Public address: [email protected] > Public archive: http://groups.google.com/group/abram-demski > Private address: [email protected] > > > ------------------------------------------- > agi > Archives: https://www.listbox.com/member/archive/303/=now > RSS Feed: https://www.listbox.com/member/archive/rss/303/ > Modify Your Subscription: > https://www.listbox.com/member/?& > Powered by Listbox: http://www.listbox.com >
------------------------------------------- agi Archives: https://www.listbox.com/member/archive/303/=now RSS Feed: https://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: https://www.listbox.com/member/?member_id=8660244&id_secret=123753653-47f84b Powered by Listbox: http://www.listbox.com
