Re: [agi] Introducing Steve's "Theory of Everything" in cognition.

Eric Burton Fri, 02 Jan 2009 23:57:45 -0800

Hey. I didn't even like this thread. I'll be right back

On 1/2/09, Abram Demski <[email protected]> wrote:
> Steve,
>
> I'm thinking that you are taking "understanding" to mean something
> like "identifying the *actual* hidden variables responsible for the
> pattern, and finding the *actual* state of that variable".
> Probabilistic models instead *invent* hidden variables, that happen to
> help explain the data. Is that about right? If so, then explaining
> what I mean by "functionally equivalent" will help. Here is an
> example: suppose that we are looking at data concerning a set of
> chemical experiments. Suppose that the experimental conditions are not
> very well-controlled, so that interesting hidden variables are
> present. Suppose that two of these are temperature and air pressure,
> but that the two have the same effect on the experiment. Then the
> unsupervised learning will have no way of distinguishing between the
> two, so it will only find one hidden variable representing them. So,
> they are functionally equivalent.
>
> This implies that, in the absence of further information, the best
> thing we can do to try to "understand" the data is to
> probabilistically model it.
>
> Or perhaps when you say "understanding" it is short for "understanding
> the implications of", ie, in an already-present model. In that case,
> perhaps we could separate the quality of predictions from the speed of
> predictions. A complicated-but-accurate model is useless if we can't
> calculate the information we need quickly enough. So, we also want an
> "understandable" model: one that doesn't take too long to create
> predictions. This would be different than looking for the best
> probabilistic model in terms of prediction accuracy. On the other
> hand, it is irrelevant in (practically?) all neural-network style
> approaches today, because the model size is fixed anyway.
>
> If the output is being fed to humans rather than further along the
> network, as in the conference example, the situation is very
> different. Human-readability becomes an issue. This paper is a good
> example of an approach that creates better human-readability rather
> than better performance:
>
> http://www.stanford.edu/~hllee/nips07-sparseDBN.pdf
>
> The altered algorithm also seems to have a performance that matches
> more closely with statistical analysis of the brain (which was the
> research goal), suggesting a correlation between human-readability and
> actual performance gains (since the brain wouldn't do it if it were a
> bad idea). In a probabilistic framework this is represented best by a
> prior bias for simplicity.
>
> --Abram
>
> On Fri, Jan 2, 2009 at 1:36 PM, Steve Richfield
> <[email protected]> wrote:
>> Abram,
>>
>> Oh dammitall, I'm going to have to expose the vast extent of my
>> profound ignorance to respond. Oh well...
>>
>> On 1/1/09, Abram Demski <[email protected]> wrote:
>>>
>>> Steve,
>>>
>>> Sorry for not responding for a little while. Comments follow:
>>>
>>> >>
>>> >> PCA attempts to isolate components that give maximum
>>> >> information... so my question to you becomes, do you think that the
>>> >> problem you're pointing towards is suboptimal models that don't
>>> >> predict the data well enough, or models that predict the data fine but
>>> >> aren't directly useful for what you expect them to be useful for?
>>> >
>>> >
>>> > Since prediction is NOT the goal, but rather just a useful measure, I
>>> > am
>>> > only interested in recognizing
>>> > that which can be recognized, and NOT in expending resources on
>>> > "understanding" semi-random noise.
>>> > Further, since compression is NOT my goal, I am not interested in
>>> > combining
>>> > features
>>> > in ways that minimize the number of components. In short, there is a
>>> > lot
>>> > to
>>> > be learned from PCA,
>>> > but a "perfect" PCA solution is likely a less-than-perfect NN solution.
>>>
>>> What I am saying is this: a good predictive model will predict
>>> whatever is desired. Unsupervised learning attempts to find such a
>>> model. But, a good predictive model will probably predict lots of
>>> stuff we aren't particularly interested in, so supervised methods have
>>> been invented to predict single variables when those variables are of
>>> interest. Still, in principle, we could use unsupervised methods.
>>> Furthermore (as I understand it), if we are dealing with lots of
>>> variables and believe deep patterns are present, unsupervised learning
>>> can outperform supervised learning by grabbing onto patterns that may
>>> ultimately lead to the desired result, which supervised learning would
>>> miss because no immediate value was evident. But, anyway, my point is
>>> that I can only see two meanings for the word "goodness":
>>>
>>> --usefulness in predicting the data as a whole
>>> --usefulness in predicting reward in particular (the real goal)
>>
>>
>> I'm still hung up on "predicting", which may indeed be the best measure of
>> value, but AGI efforts need understanding, which is subtly different. OK,
>> so
>> what is the difference?
>>
>> The tree of reality has many branches in the future - there are many
>> possible futures. "Understanding" is the process of keeping track of which
>> branch you are on, while "predicting" is taking shots at which branch will
>> prevail. One may necessarily involve the other. Has anyone thought
>> this through yet?
>>>
>>> (Actually, I can think of a third: usefulness in *getting* reward (ie,
>>> motor control). But, I feel adding that to the discussion would be
>>> premature... there are interesting issues, but they are separate from
>>> the ones being discussed here...)
>>>
>>> >>
>>> >> To that end... you weren't talking about using the *predictions* of
>>> >> the PCA model, but rather the principle components themselves. The
>>> >> components are essentially hidden variables to make the model run.
>>> >
>>> >
>>> > ... or variables smushed together in ways that may work well for
>>> > compression, but poorly for recognition.
>>>
>>> What are the variables that you keep worrying might be smushed
>>> together? Can you give an example?
>>
>>
>> I thought I could, but then I ran into problems as you discussed below.
>>>
>>> If PCA smushes variables together,
>>> that suggests 1 of 3 things:
>>>
>>> --PCA found suboptimal components
>>
>>
>> Here, I am hung up on "found". This implies a multitude of "solutions",
>> yet
>> there are guys out there who are beating on the matrix manipulations to
>> "solve" PCA. Is this like non-zero-sum game theory, where there can be
>> many
>> solutions, some better than others?
>>>
>>> --PCA found optimal components, but the hidden variables that got
>>> smooshed really are functionally equivalent (when looked at through
>>> the lens of the available visible variables)
>>
>>
>> Here, I am hung up on "functionally". This presumes supervised learning or
>> divine observation.
>>>
>>> --The true probabilistic situation violates the probabilistic
>>> assumptions behind PCA
>>>
>>> The third option is by far the most probable, I think.
>>
>>
>> That's where I got stuck trying to come up with an example.
>>>
>>> >>
>>> >> or in an attempt to complexify the model to make it more accurate in
>>> >> its predictions, by looking for links between the hidden variables, or
>>> >> patterns over time, et cetera.
>>> >
>>> >
>>> > Setting predictions aside, the next layer of PCA-like neurons would be
>>> > looking for those links.
>>>
>>> Absolutely.
>>
>>
>> More on my ignorance...
>>
>> I and PCA hadn't really "connected" until a few months ago, when I
>> attended
>> a computer conference and listened to several presentations. The (possibly
>> false, at least in some instances) impression I got was that the
>> presenters
>> didn't really understand some/many of the "components" that they were
>> finding. One video compression presenter did identify the first few, but
>> admittedly failed to identify later components.
>> I can see that this process necessarily involves a tiny amount of a priori
>> information, specifically, knowledge of:
>> 1.  The physical extent of features, e.g. as controlled by mutual
>> inhibition.
>> 2.  The threshold for feature recognition, e.g. the number of active
>> synapses that must be involved for a feature to be interesting.
>> 3.  The acceptable "fuzziness" of recognition, e.g. just how accurately
>> must
>> a feature match its "pattern".
>> 4.  ??? What have I missed in this list?
>> 5.  Some or all of the above may be calculable based on ???
>>
>> Thanks for your help.
>>
>> Steve Richfield
>>
>> ________________________________
>> agi | Archives | Modify Your Subscription
>
>
>
> --
> Abram Demski
> Public address: [email protected]
> Public archive: http://groups.google.com/group/abram-demski
> Private address: [email protected]
>
>
> -------------------------------------------
> agi
> Archives: https://www.listbox.com/member/archive/303/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/303/
> Modify Your Subscription:
> https://www.listbox.com/member/?&;
> Powered by Listbox: http://www.listbox.com
>



-------------------------------------------
agi
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=8660244&id_secret=123753653-47f84b
Powered by Listbox: http://www.listbox.com

Re: [agi] Introducing Steve's "Theory of Everything" in cognition.

Reply via email to