From: Fred L. Bookstein Seattle, August 6, 2017

Dear MorphMetters,

`This note calls your attention to two new papers of mine that attempt to`

`rebuild the advanced part of GMM -- the part of the toolbox that`

`transforms linear multivariate statistical analysis of shape coordinates`

`from mere arithmetic into real understanding of an evolutionary or`

`developmental process.`

`One of the papers argues the importance for our work of a deep theorem`

`from 1967 that isn't in our textbooks yet. Suppose your data set`

`consists of a long list of shape coordinates over a sample whose count`

`of specimens is a small multiple of that shape coordinate count, say,`

`fewer than ten times the count of landmarks. Usually, in circumstances`

`like these, most of the dimensions of the shape coordinate space can be`

`modeled the way Dryden and Mardia indicated decades ago: as uncorrelated`

`Gaussians with the same small variance. And when _that_ is the case,`

`which is most of the time in the current GMM literature, the`

`multivariate statistical tests that we invoke most often will typically`

`lead to invalid evolutionary or developmental inferences and`

`interpretations. One common type of study cautioned by this warning is`

`the design that involves inverting a covariance matrix (in the course of`

`tasks like multiple regression, relative eigenanalysis, or MANOVA)`

`without examining all its principal components and their eigenvalues,`

`not just the largest few. Another large suspect group comprises the`

`studies that accept such a matrix as truth for some sort of`

`maximum-likelihood analysis or permutation analysis rather than formally`

`modeling it by a specific combination of biologically sensible factors`

`together with noise. My paper goes on to demonstrate alternative`

`approaches to biological understanding, including one that applies to`

`partial least squares analyses, and closes with six "imperatives" (terse`

`advisory slogans).`

`The other paper of this pair shows why principal components analysis of`

`shape coordinate data, an approach that has often been regarded as a`

`useful classification tool since the earliest days of GMM, should not`

`ever be trusted to arrive at a valid understanding of the organismal`

`process that interests you, however excellent your study design might`

`otherwise be. When people claim the opposite, they are expressing a`

`heartfelt wish for which there is no actual biological justification:`

`that maximizing the variance of a linear combination of Procrustes shape`

`coordinates validly conveys a meaning for the organism regardless of how`

`those landmarks and the attached shape-coordinate displacement vectors`

`might be situated over its idealized image. I go on to introduce and`

`demonstrate a novel alternative, varimax factor analysis of`

`bending-energy-adjusted partial warp scores, that I think our community`

`should explore as a candidate for the missing praxis.`

`Together the papers disqualify just about every statistic except for`

`shape regression that we ever taught you to compute once you had`

`produced your Procrustes shape coordinates. Both of the papers reassert`

`and extend my argument of recent years that Procrustes distance per se`

`is not a biologically meaningful quantity. To get to a valid biological`

`explanation from Procrustes shape coordinates, you need a pattern`

`language more powerful than what textbook multivariate statistics offers`

`us -- a pattern language that pays attention to the shape coordinate`

`averages (that is, to the mean landmark configuration) along with their`

`covariance matrix. An excerpt from the first of these papers might serve`

`as a good summary of both: "Linear multivariate analysis of shape`

`coordinate data is difficult not only computationally but also`

`conceptually. You should not permit your computer to make it seem easy`

`by oversimplifying either your questions or your answers."`

`Both papers are available via the preprint servers of the corresponding`

`journal websites. The first of the two, "A newly noticed formula`

`enforces fundamental limits on geometric morphometric analysis," has`

`DOI 10.1007/s11692-017-9424-9 or can be reached via the website for`

`Evolutionary Biology (note: this is a different journal from Journal of`

`Evolutionary Biology, so google it carefully). The second, "A method of`

`factor analysis for shape coordinates," has DOI 10.1002/ajpa.23277 and`

`also is accessible via the website of the American Journal of Physical`

`Anthropology.`

Here are their abstracts. For the first one:

`The textbook literature of principal components analysis (PCA) dates`

`from a period when statistical computing was much less powerful than it`

`is today and the dimensionality of data sets typically processed by PCA`

`correspondingly much lower. When the formulas in those textbooks involve`

`limiting properties of PCA descriptors, the limit involved is usually`

`the indefinite increase of sample size for a fixed roster of variables.`

`But contemporary applications of PCA in organismal systems biology,`

`particularly in geometric morphometrics (GMM), generally involve much`

`greater counts of variables. The way one might expect pure noise to`

`degrade the biometric signal in this more contemporary context is`

`described by a different mathematical literature concerned with the`

`situation where the count of variables itself increases while remaining`

`proportional to the count of specimens. The founders of this literature`

`established a result of startling simplicity.`

`Consider steadily larger and larger data sets consisting of completely`

`uncorrelated standardized Gaussians (mean zero, variance 1) such that`

`the ratio of variables to cases (the so-called p/n ratio) is fixed at a`

`value y. Then the largest eigenvalue of their covariance matrix tends`

`to (1+\sqrt{y})^2, the smallest tends to (1-\sqrt{y})^2, and their ratio`

`tends to the limiting value ((1+\sqrt{y})/(1-\sqrt{y}))^2, whereas in`

`the uncorrelated model both of these eigenvalues and also their ratio`

`should be just 1.0. For y=1/4, not an atypical value for GMM data sets,`

`this ratio is 9; for y=1/2, which is still not atypical, it is 34.`

`These extrema and ratios, easily confirmed in simulations of realistic`

`size and consistent with real GMM findings in typical applied settings,`

`bear severe negative implications for any technique that involves`

`inverting a covariance structure on shape coordinates, including`

`multiple regression on shape, discriminant analysis by shape, canonical`

`variates analysis of shape, covariance distance analysis from shape, and`

`maximum-likelihood estimation of shape distributions that are not`

`constrained by strong prior models. The theorem also suggests that we`

`should use extreme caution whenever considering a biological`

`interpretation of any Partial Least Squares analysis involving large`

`numbers of landmarks or semilandmarks. I illuminate these concerns with`

`the aid of one simulation, two explicit reanalyses of previously`

`published data, and several little sermons.`

For the second one:

`Currently the most common reporting style for a geometric morphometric`

`(GMM) analysis of anthropological data begins with the principal`

`components of the shape coordinates to which the original landmark data`

`have been converted. But this focus often frustrates the organismal`

`biologist, mainly because principal component analysis (PCA) is not`

`aimed at scientific interpretability of the loading patterns actually`

`uncovered. The difficulty of making biological sense of a PCA is`

`heightened by aspects of the shape coordinate setting that further`

`diverge from our intuitive expectations of how morphometric measurements`

`ought to combine. More than fifty years ago one of our sister`

`disciplines, psychometrics, managed to build an algorithmic route from`

`principal component analysis to scientific understanding via the toolkit`

`generally known as factor analysis. This article introduces a`

`modification of one standard factor-analysis approach, Henry Kaiser's`

`varimax rotation of 1958, that accommodates two of the major differences`

`between the GMM context and the psychometric context for these`

`approaches: the coexistence of "general" and "special" factors of form`

`as adumbrated by Sewall Wright, and the typical loglinearity of partial`

`warp variance as a function of bending energy. I briefly explain the`

`history of principal components in biometrics and the contrast with`

`factor analysis, introduce the modified varimax algorithm I am`

`recommending, and work three examples that are reanalyses of previously`

`published cranial data sets. A closing discussion emphasizes the`

`desirability of superseding PCA by algorithms aimed at anthropological`

`understanding rather than classification or ordination.`

-- MORPHMET may be accessed via its webpage at http://www.morphometrics.org

`---`

`You received this message because you are subscribed to the Google Groups "MORPHMET" group.`

To unsubscribe from this group and stop receiving emails from it, send an email to morphmet+unsubscr...@morphometrics.org.