Gareth Russell writes:
> Wirt Atmar's erudite reply unfortunately ends with the wrong answer to your
> question: it is the lowest number you want, not the lowest absolute number.
Yes, that's correct. My own personal colloquialisms crept into a technical
discussion, as well as my failing to not read Bonnie's question carefully
enough, and I did not mean what I wrote. I very much appreciate Gareth's
clairification. My apologies.
The discussion I was having -- primarily with myself, I suspect -- was in
making the choice between the goodness-of-fit component of the AIC and its
cost-of-complexity penalty. It is possible that you may have multiple minima
that
are closely valued. At that point, you have the opportunity to chose which
attributes of your various models you deem most appropriate to your
circumstance,
quality of prediction or simplicity.
What I wrote in that regard was, "These choices of course presume that you
have multiple minima in your AIC values. It is entirely possible that the
optimization surface is a simple bowl however with a single point of global
optimality. If that's so, and the true costs of additional complexity are
accurately
represented in the second term, then you would always choose the lowest
absolute value." What I should have written, and what I meant, was that "[in
the case
of a single extremum,] then you absolutely want to choose the lowest value."
Gareth's comments do allow me however an opportunity to expand a little bit
on my previous posting. I personally hold David Anderson and Ken Burnham in
very high regard, but I worry that the AIC is being oversold to the ecological
community -- for two different reasons.
The first is that there is a high degree of arbitariness to formulation of
the AIC. Cost-of-complexity penalties are common in engineering equations, but
the penalty rate isn't something assigned by God or Mother Nature. It depends
more on the whims of the equation's author at the time he first wrote the
equation than anything else.
Akaike wrote his equation in this manner:
(-2 log likelihood) + (2k)
...where the first term measures the goodness-of-fit of the model and the
second penalizes its complexity, respectively. But he could have just as easily
written the penalty term as:
(-2 log likelihood) + (2log(k))
or
(-2 log likelihood) + (2k^2)
If he had chosen the first, then his equation would have greatly emphasized
goodness-of-fit over any complexity penalty [the penalty for complexity grows
very slowly as more parameters are added when the log of the number is being
taken]. Indeed, if he had made the complexity penalty zero, he would have
simply
thrown Ockham's Razor out of the window, and any maximally good fit would
have been equally acceptable.
[BTW, for a very brief but quite good practical explanation of Ockham's Razor
in the practice of physics, please see:
http://phyun5.ucr.edu/~wudka/Physics7/Notes_www/node10.html ]
But if Akaike had chosen the second form however, simplicity would have
greatly outweighed high fidelity predictions in the resulting specification of
the
best model because the cost-of-complexity term rises so quickly. While Akaike
chose a linear penalty form, there's nothing especially magical or sacred in
that choice.
This arbitrariness is to a degree just as evident in Shannon's equation for
the average entropy of an ensemble of symbols:
H' = - sum (pi log pi)
Shannon defined his information metric, I = - log pi, as a log term for a
variety of reasons, in part because it has pleasing additive properties, but
most
especially because it was an homage to Boltzmann's equation for entropy:
S = - k log W
...the governing philosophy that Shannon wanted to mimic in his specific
definition of entropy.
While Shannon's average entropy is to some extent arbitrarily defined,
Boltzmann's is not. Boltzmann's equation is an extraordinarily insightful
redefinition of Clausius' original definition of entropy:
S = dQ/dT
...where Clausius treated entropy as a bulk quality of heat rather than as an
assemblage of microscopic states as Boltzmann did. The "k" in Boltzmann's
equation is merely a scaling factor to align the two definitions. Both
Shannon's
and Boltzmann's equations caused revolutions in thought, but they're not quite
equal in their explanatory values of the natural world. Boltzmann's equation
is much more akin to Newton's F = ma than is Shannon's. Indeed, Boltzmann's
equation was considered so important that it is engraved on his tombstone:
http://www.wellesley.edu/Chemistry/stats/boltz.jpg
However let me not overemphasize the arbitrariness of Shannon's equation
either. Boltzmann's S and Shannon's H can be made equivalent of one another, as
you can see near the bottom of this page:
http://www.answers.com/topic/entropy
And that brings me to my second concern. All models are not equal in their
value to us. The equations of Shannon, Boltzmann, Clausius, Kepler and Einstein
represent fundamental understandings of the governing rules of the universe.
And in that regard, they represent a deep human understanding, which is of
course the primary goal of science. Indeed, Einstein's E = mc^2 was such a
triumph
because it doesn't even require a scaling factor to relate such previously
disparate qualities as mass and energy. Due to earlier careful measurements, we
had already gotten the units correct.
This is a qualitatively different condition than sequentially running through
every conceivable polynomial model, attempting to choose the best solely by
means of some mechanical metric such as the AIC. If that's done, in the end
nothing has been learned, and the question becomes: was it even worth the
effort?
I would have a terrible time calling this scattershot procedure science.
Nevertheless, let me also say at this point that this scattershot method has
also received a measure of high acceptance in the scientific community of
late. The most exquisite example of the simultaneous engineering utility and
scientific meaningless of the procedure exists in the training of neural
networks.
For example, in the figure:
http://smig.usgs.gov/SMIG/features_0902/tualatin_ann.fig3.gif
...each circle is a node, composed of something engineers call a "threshold
logic unit" and mathematicians might call a "linear discriminant function with
a threshold." It has been proven that three layers of such nodes are minimally
necessary to predict any form of complex discriminant function. It has also
been proven that no more than three layers are necessary, thus structures of
this form are now called 3-layer feedforward perceptrons. The network as a
whole
is "trained" (that is the weights are adjusted) to provide an accurate
prediction using methods such as backpropagation or Darwinian competitions
between
various networks.
But what took some time was for the philosophical realization to sink in that
the evolved weights carry absolutely no meaning, at least in the sense that
we normally associate with the equations of Boltzmann, Newton and Einstein. Nor
do we learn anything by examining the internal architecture of the resulting
network in detail. This notion of meaningless was discomfiting to a large
number of early workers in neural nets, in great part because they came from a
physical sciences background, but I believe that everyone now accepts it to be
so.
In that regard, no better example exists than in the very pretty work
recently done by one of my former students, David Fogel. David has constructed
neural
networks to play checkers and chess at grand champion levels by allowing
various neural networks to compete within Darwinian arenas. In David's
experiments, it took six months for the champion checkers network to be
evolved, but
that's only because he ran that evolution on a very slow, single PC. That
evolution could have been sped up from six months to just an hour or two by
running
the competitions on a thousand, much higher-speed PCs in parallel.
David recently gave a talk on the procedure and it's up as a narrated slide
show at:
http://ebrains.la.asu.edu/~jennie/tutorial/audio_tutorial.htm
If you have the time, it's well worth watching. And when you watch it, think
about the AIC minimization of the complexity of David's black-box neural
networks. In the checkers program alone there are close to 2000 parameters to
adjust. Although David gave it no consideration (he only used one pre-chosen
neural
architecture), there could easily be simpler architectures that could produce
comparable results, and perhaps more complex structures that would produce
even better results. It's just that it would take years of trial-and-error
experiments to determine if that were so and how to properly assign a
complexity
penalty.
As I say, although I have some substantial trouble considering the evolutions
of these various architectures and weights to be in any way science, they are
on the other hand philosophically very reminiscent of the evolutionary
process itself, in that the twin processes of pleiotropy and polygeny dominate
the
outcome, making any statement-of-causation fruitless. It's clear at the end of
the most casual observations of natural flight that the aerodynamic equations
of Venturi, Bernoulli, Torricelli and von Karman are deeply embedded in the
genome of every bird, bat, or butterfly, but where are they specifically? There
is no place in the genome of any of these animals that the equations of flight
are explicitly written out in rules form, and yet natural selection has built
predictive models, items we call wings, for the various flight regimes at
least as well as we can build them ourselves.
And that is the essence of my second reservation about the uncritical use of
the AIC: the tendency to whip through model after model, but in the end
understanding nothing more about the system under study than you had before you
began. While in the end, all models must be accurately predictive of the world
they claim to mimic, there is an extraordinary qualitative difference between
those equations that been determined to accurately represent the subtle physics
of the universe and those that have been devised by "automated discovery"
mechanisms.
Wirt Atmar