Re: AIC

Wirt Atmar Sun, 05 Mar 2006 16:08:55 -0800

Gareth Russell writes:

> Wirt Atmar's erudite reply unfortunately ends with the wrong answer to your 
> question: it is the lowest number you want, not the lowest absolute number.


Yes, that's correct. My own personal colloquialisms crept into a technical 
discussion, as well as my failing to not read Bonnie's question carefully 
enough, and I did not mean what I wrote. I very much appreciate Gareth's 
clairification. My apologies. 

The discussion I was having -- primarily with myself, I suspect -- was in 
making the choice between the goodness-of-fit component of the AIC and its 
cost-of-complexity penalty. It is possible that you may have multiple minima 
that 
are closely valued. At that point, you have the opportunity to chose which 
attributes of your various models you deem most appropriate to your 
circumstance, 
quality of prediction or simplicity.

What I wrote in that regard was, "These choices of course presume that you 
have multiple minima in your AIC values. It is entirely possible that the 
optimization surface is a simple bowl however with a single point of global 
optimality. If that's so, and the true costs of additional complexity are 
accurately 
represented in the second term, then you would always choose the lowest 
absolute value." What I should have written, and what I meant, was that "[in 
the case 
of a single extremum,] then you absolutely want to choose the lowest value."

Gareth's comments do allow me however an opportunity to expand a little bit 
on my previous posting. I personally hold David Anderson and Ken Burnham in 
very high regard, but I worry that the AIC is being oversold to the ecological 
community -- for two different reasons.

The first is that there is a high degree of arbitariness to formulation of 
the AIC. Cost-of-complexity penalties are common in engineering equations, but 
the penalty rate isn't something assigned by God or Mother Nature. It depends 
more on the whims of the equation's author at the time he first wrote the 
equation than anything else.

Akaike wrote his equation in this manner:

     (-2 log likelihood) + (2k)

...where the first term measures the goodness-of-fit of the model and the 
second penalizes its complexity, respectively. But he could have just as easily 
written the penalty term as:

     (-2 log likelihood) + (2log(k))

or

     (-2 log likelihood) + (2k^2)

If he had chosen the first, then his equation would have greatly emphasized 
goodness-of-fit over any complexity penalty [the penalty for complexity grows 
very slowly as more parameters are added when the log of the number is being 
taken]. Indeed, if he had made the complexity penalty zero, he would have 
simply 
thrown Ockham's Razor out of the window, and any maximally good fit would 
have been equally acceptable.

[BTW, for a very brief but quite good practical explanation of Ockham's Razor 
in the practice of physics, please see:

     http://phyun5.ucr.edu/~wudka/Physics7/Notes_www/node10.html ]

But if Akaike had chosen the second form however, simplicity would have 
greatly outweighed high fidelity predictions in the resulting specification of 
the 
best model because the cost-of-complexity term rises so quickly. While Akaike 
chose a linear penalty form, there's nothing especially magical or sacred in 
that choice.

This arbitrariness is to a degree just as evident in Shannon's equation for 
the average entropy of an ensemble of symbols:

     H' = - sum (pi log pi)

Shannon defined his information metric, I = - log pi, as a log term for a 
variety of reasons, in part because it has pleasing additive properties, but 
most 
especially because it was an homage to Boltzmann's equation for entropy: 

     S = - k log W

...the governing philosophy that Shannon wanted to mimic in his specific 
definition of entropy.

While Shannon's average entropy is to some extent arbitrarily defined, 
Boltzmann's is not. Boltzmann's equation is an extraordinarily insightful 
redefinition of Clausius' original definition of entropy:

     S = dQ/dT

...where Clausius treated entropy as a bulk quality of heat rather than as an 
assemblage of microscopic states as Boltzmann did. The "k" in Boltzmann's 
equation is merely a scaling factor to align the two definitions. Both 
Shannon's 
and Boltzmann's equations caused revolutions in thought, but they're not quite 
equal in their explanatory values of the natural world. Boltzmann's equation 
is much more akin to Newton's F = ma than is Shannon's. Indeed, Boltzmann's 
equation was considered so important that it is engraved on his tombstone:

     http://www.wellesley.edu/Chemistry/stats/boltz.jpg

However let me not overemphasize the arbitrariness of Shannon's equation 
either. Boltzmann's S and Shannon's H can be made equivalent of one another, as 
you can see near the bottom of this page:

     http://www.answers.com/topic/entropy

And that brings me to my second concern. All models are not equal in their 
value to us. The equations of Shannon, Boltzmann, Clausius, Kepler and Einstein 
represent fundamental understandings of the governing rules of the universe. 
And in that regard, they represent a deep human understanding, which is of 
course the primary goal of science. Indeed, Einstein's E = mc^2 was such a 
triumph 
because it doesn't even require a scaling factor to relate such previously 
disparate qualities as mass and energy. Due to earlier careful measurements, we 
had already gotten the units correct.

This is a qualitatively different condition than sequentially running through 
every conceivable polynomial model, attempting to choose the best solely by 
means of some mechanical metric such as the AIC. If that's done, in the end 
nothing has been learned, and the question becomes: was it even worth the 
effort? 
I would have a terrible time calling this scattershot procedure science.

Nevertheless, let me also say at this point that this scattershot method has 
also received a measure of high acceptance in the scientific community of 
late. The most exquisite example of the simultaneous engineering utility and 
scientific meaningless of the procedure exists in the training of neural 
networks. 
For example, in the figure:

     http://smig.usgs.gov/SMIG/features_0902/tualatin_ann.fig3.gif

...each circle is a node, composed of something engineers call a "threshold 
logic unit" and mathematicians might call a "linear discriminant function with 
a threshold." It has been proven that three layers of such nodes are minimally 
necessary to predict any form of complex discriminant function. It has also 
been proven that no more than three layers are necessary, thus structures of 
this form are now called 3-layer feedforward perceptrons. The network as a 
whole 
is "trained" (that is the weights are adjusted) to provide an accurate 
prediction using methods such as backpropagation or Darwinian competitions 
between 
various networks.

But what took some time was for the philosophical realization to sink in that 
the evolved weights carry absolutely no meaning, at least in the sense that 
we normally associate with the equations of Boltzmann, Newton and Einstein. Nor 
do we learn anything by examining the internal architecture of the resulting 
network in detail.  This notion of meaningless was discomfiting to a large 
number of early workers in neural nets, in great part because they came from a 
physical sciences background, but I believe that everyone now accepts it to be 
so.

In that regard, no better example exists than in the very pretty work 
recently done by one of my former students, David Fogel. David has constructed 
neural 
networks to play checkers and chess at grand champion levels by allowing 
various neural networks to compete within Darwinian arenas. In David's 
experiments, it took six months for the champion checkers network to be 
evolved, but 
that's only because he ran that evolution on a very slow, single PC. That 
evolution could have been sped up from six months to just an hour or two by 
running 
the competitions on a thousand, much higher-speed PCs in parallel.

David recently gave a talk on the procedure and it's up as a narrated slide 
show at:

     http://ebrains.la.asu.edu/~jennie/tutorial/audio_tutorial.htm

If you have the time, it's well worth watching. And when you watch it, think 
about the AIC minimization of the complexity of David's black-box neural 
networks. In the checkers program alone there are close to 2000 parameters to 
adjust. Although David gave it no consideration (he only used one pre-chosen 
neural 
architecture), there could easily be simpler architectures that could produce 
comparable results, and perhaps more complex structures that would produce 
even better results. It's just that it would take years of trial-and-error 
experiments to determine if that were so and how to properly assign a 
complexity 
penalty.

As I say, although I have some substantial trouble considering the evolutions 
of these various architectures and weights to be in any way science, they are 
on the other hand philosophically very reminiscent of the evolutionary 
process itself, in that the twin processes of pleiotropy and polygeny dominate 
the 
outcome, making any statement-of-causation fruitless. It's clear at the end of 
the most casual observations of natural flight that the aerodynamic equations 
of Venturi, Bernoulli, Torricelli and von Karman are deeply embedded in the 
genome of every bird, bat, or butterfly, but where are they specifically? There 
is no place in the genome of any of these animals that the equations of flight 
are explicitly written out in rules form, and yet natural selection has built 
predictive models, items we call wings, for the various flight regimes at 
least as well as we can build them ourselves.

And that is the essence of my second reservation about the uncritical use of 
the AIC: the tendency to whip through model after model, but in the end 
understanding nothing more about the system under study than you had before you 
began. While in the end, all models must be accurately predictive of the world 
they claim to mimic, there is an extraordinary qualitative difference between 
those equations that been determined to accurately represent the subtle physics 
of the universe and those that have been devised by "automated discovery" 
mechanisms.

Wirt Atmar

Re: AIC

Reply via email to