[MORPHMET] New Bookstein GMM papers: PCA and Factor Analysis...

From: Fred L. Bookstein            Seattle, August 6, 2017

Dear MorphMetters,


This note calls your attention to two new papers of mine that attempt to rebuild the advanced part of GMM -- the part of the toolbox that transforms linear multivariate statistical analysis of shape coordinates from mere arithmetic into real understanding of an evolutionary or developmental process.


One of the papers argues the importance for our work of a deep theorem from 1967 that isn't in our textbooks yet. Suppose your data set consists of a long list of shape coordinates over a sample whose count of specimens is a small multiple of that shape coordinate count, say, fewer than ten times the count of landmarks. Usually, in circumstances like these, most of the dimensions of the shape coordinate space can be modeled the way Dryden and Mardia indicated decades ago: as uncorrelated Gaussians with the same small variance. And when _that_ is the case, which is most of the time in the current GMM literature, the multivariate statistical tests that we invoke most often will typically lead to invalid evolutionary or developmental inferences and interpretations. One common type of study cautioned by this warning is the design that involves inverting a covariance matrix (in the course of tasks like multiple regression, relative eigenanalysis, or MANOVA) without examining all its principal components and their eigenvalues, not just the largest few. Another large suspect group comprises the studies that accept such a matrix as truth for some sort of maximum-likelihood analysis or permutation analysis rather than formally modeling it by a specific combination of biologically sensible factors together with noise. My paper goes on to demonstrate alternative approaches to biological understanding, including one that applies to partial least squares analyses, and closes with six "imperatives" (terse advisory slogans).


The other paper of this pair shows why principal components analysis of shape coordinate data, an approach that has often been regarded as a useful classification tool since the earliest days of GMM, should not ever be trusted to arrive at a valid understanding of the organismal process that interests you, however excellent your study design might otherwise be. When people claim the opposite, they are expressing a heartfelt wish for which there is no actual biological justification: that maximizing the variance of a linear combination of Procrustes shape coordinates validly conveys a meaning for the organism regardless of how those landmarks and the attached shape-coordinate displacement vectors might be situated over its idealized image. I go on to introduce and demonstrate a novel alternative, varimax factor analysis of bending-energy-adjusted partial warp scores, that I think our community should explore as a candidate for the missing praxis.


Together the papers disqualify just about every statistic except for shape regression that we ever taught you to compute once you had produced your Procrustes shape coordinates. Both of the papers reassert and extend my argument of recent years that Procrustes distance per se is not a biologically meaningful quantity. To get to a valid biological explanation from Procrustes shape coordinates, you need a pattern language more powerful than what textbook multivariate statistics offers us -- a pattern language that pays attention to the shape coordinate averages (that is, to the mean landmark configuration) along with their covariance matrix. An excerpt from the first of these papers might serve as a good summary of both: "Linear multivariate analysis of shape coordinate data is difficult not only computationally but also conceptually. You should not permit your computer to make it seem easy by oversimplifying either your questions or your answers."


Both papers are available via the preprint servers of the corresponding journal websites. The first of the two, "A newly noticed formula enforces fundamental limits on geometric morphometric analysis," has DOI 10.1007/s11692-017-9424-9 or can be reached via the website for Evolutionary Biology (note: this is a different journal from Journal of Evolutionary Biology, so google it carefully). The second, "A method of factor analysis for shape coordinates," has DOI 10.1002/ajpa.23277 and also is accessible via the website of the American Journal of Physical Anthropology.

Here are their abstracts.

For the first one:

The textbook literature of principal components analysis (PCA) dates from a period when statistical computing was much less powerful than it is today and the dimensionality of data sets typically processed by PCA correspondingly much lower. When the formulas in those textbooks involve limiting properties of PCA descriptors, the limit involved is usually the indefinite increase of sample size for a fixed roster of variables. But contemporary applications of PCA in organismal systems biology, particularly in geometric morphometrics (GMM), generally involve much greater counts of variables. The way one might expect pure noise to degrade the biometric signal in this more contemporary context is described by a different mathematical literature concerned with the situation where the count of variables itself increases while remaining proportional to the count of specimens. The founders of this literature established a result of startling simplicity.


Consider steadily larger and larger data sets consisting of completely uncorrelated standardized Gaussians (mean zero, variance 1) such that the ratio of variables to cases (the so-called p/n ratio) is fixed at a value y. Then the largest eigenvalue of their covariance matrix tends to (1+\sqrt{y})^2, the smallest tends to (1-\sqrt{y})^2, and their ratio tends to the limiting value ((1+\sqrt{y})/(1-\sqrt{y}))^2, whereas in the uncorrelated model both of these eigenvalues and also their ratio should be just 1.0. For y=1/4, not an atypical value for GMM data sets, this ratio is 9; for y=1/2, which is still not atypical, it is 34. These extrema and ratios, easily confirmed in simulations of realistic size and consistent with real GMM findings in typical applied settings, bear severe negative implications for any technique that involves inverting a covariance structure on shape coordinates, including multiple regression on shape, discriminant analysis by shape, canonical variates analysis of shape, covariance distance analysis from shape, and maximum-likelihood estimation of shape distributions that are not constrained by strong prior models. The theorem also suggests that we should use extreme caution whenever considering a biological interpretation of any Partial Least Squares analysis involving large numbers of landmarks or semilandmarks. I illuminate these concerns with the aid of one simulation, two explicit reanalyses of previously published data, and several little sermons.

For the second one:

Currently the most common reporting style for a geometric morphometric (GMM) analysis of anthropological data begins with the principal components of the shape coordinates to which the original landmark data have been converted. But this focus often frustrates the organismal biologist, mainly because principal component analysis (PCA) is not aimed at scientific interpretability of the loading patterns actually uncovered. The difficulty of making biological sense of a PCA is heightened by aspects of the shape coordinate setting that further diverge from our intuitive expectations of how morphometric measurements ought to combine. More than fifty years ago one of our sister disciplines, psychometrics, managed to build an algorithmic route from principal component analysis to scientific understanding via the toolkit generally known as factor analysis. This article introduces a modification of one standard factor-analysis approach, Henry Kaiser's varimax rotation of 1958, that accommodates two of the major differences between the GMM context and the psychometric context for these approaches: the coexistence of "general" and "special" factors of form as adumbrated by Sewall Wright, and the typical loglinearity of partial warp variance as a function of bending energy. I briefly explain the history of principal components in biometrics and the contrast with factor analysis, introduce the modified varimax algorithm I am recommending, and work three examples that are reanalyses of previously published cranial data sets. A closing discussion emphasizes the desirability of superseding PCA by algorithms aimed at anthropological understanding rather than classification or ordination.

--
MORPHMET may be accessed via its webpage at http://www.morphometrics.org

--- You received this message because you are subscribed to the Google Groups "MORPHMET" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to morphmet+unsubscr...@morphometrics.org.