[Your data and output listings removed. For comments, see at end] On 24-May-09 13:01:26, cdm wrote: > Fellow R Users: > I'm not extremely familiar with lda or R programming, but a recent > editorial review of a manuscript submission has prompted a crash > course. I am on this forum hoping I could solicit some much needed > advice for deriving a classification equation. > > I have used three basic measurements in lda to predict two groups: > male and female. I have a working model, low Wilk's lambda, graphs, > coefficients, eigenvalues, etc. (see below). I adjusted the sample > analysis for Fisher's or Anderson's Iris data provided in the MASS > library for my own data. > > My final and last step is simply form the classification equation. > The classification equation is simply using standardized coefficients > to classify each group- in this case male or female. A more thorough > explanation is provided: > > "For cases with an equal sample size for each group the classification > function coefficient (Cj) is expressed by the following equation: > > Cj = cj0+ cj1x1+ cj2x2+...+ cjpxp > > where Cj is the score for the jth group, j = 1 ⦠k, cjo is the > constant for the jth group, and x = raw scores of each predictor. > If W = within-group variance-covariance matrix, and M = column matrix > of means for group j, then the constant cjo= (-1/2)CjMj" (Julia > Barfield, John Poulsen, and Aaron French > http://userwww.sfsu.edu/~efc/classes/biol710/discrim/discriminant.htm). > > I am unable to navigate this last step based on the R output I have. > I only have the linear discriminant coefficients for each predictor > that would be needed to complete this equation. > > Please, if anybody is familiar or able to to help please let me know. > There is a spot in the acknowledgments for you. > > All the best, > Chase Mendenhall
The first thing I did was to plot your data. This indicates in the first place that a perfect discrimination can be obtained on the basis of your variables WRMA_WT and WRMA_ID alone (names abbreviated to WG, WT, ID, SEX): d.csv("horsesLDA.csv") # names(D0) # "WRMA_WG" "WRMA_WT" "WRMA_ID" "WRMA_SEX" WG<-D0$WRMA_WG; WT<-D0$WRMA_WT; ID<-D0$WRMA_ID; SEX<-D0$WRMA_SEX ix.M<-(SEX=="M"); ix.F<-(SEX=="F") ## Plot WT vs ID (M & F) plot(ID,WT,xlim=c(0,12),ylim=c(8,15)) points(ID[ix.M],WT[ix.M],pch="+",col="blue") points(ID[ix.F],WT[ix.F],pch="+",col="red") lines(ID,15.5-1.0*(ID)) and that there is a lot of possible variation in the discriminating line WT = 15.5-1.0*(ID) Also, it is apparent that the covariance between WT and ID for Females is different from the covariance between WT and ID for Males. Hence the assumption (of common covariance matrix in the two groups) for standard LDA (which you have been applying) does not hold. Given that the sexes can be perfectly discriminated within the data on the basis of the linear discriminator (WT + ID) (and others), the variable WG is in effect a close approximation to noise. However, to the extent that there was a common covariance matrix to the two groups (in all three variables WG, WT, ID), and this was well estimated from the data, then inclusion of the third variable WG could yield a slightly improved discriminator in that the probability of misclassification (a rare event for such data) could be minimised. But it would not make much difference! However, since that assumption does not hold, this analysis would not be valid. If you plot WT vs WG, a common covariance is more plausible; but there is considerable overlap for these two variables: plot(WG,WT) points(WG[ix.M],WT[ix.M],pch="+",col="blue") points(WG[ix.F],WT[ix.F],pch="+",col="red") If you plot WG vs ID, there is perhaps not much overlap, but a considerable difference in covariance between the two groups: plot(ID,WG) points(ID[ix.M],WG[ix.M],pch="+",col="blue") points(ID[ix.F],WG[ix.F],pch="+",col="red") This looks better on a log scale, however: lWG <- log(WG) ; lWT <- log(WT) ; lID <- log(ID) ## Plot log(WG) vs log(ID) (M & F) plot(lID,lWG) points(lID[ix.M],lWG[ix.M],pch="+",col="blue") points(lID[ix.F],lWG[ix.F],pch="+",col="red") and common covaroance still looks good for WG vs WT: ## Plot log(WT) vs log(WG) (M & F) plot(lWG,lWT) points(lWG[ix.M],lWT[ix.M],pch="+",col="blue") points(lWG[ix.F],lWT[ix.F],pch="+",col="red") but there is no improvement for WG vs IG: ## Plot log(WT) vs log(ID) (M & F) plot(ID,WT,xlim=c(0,12),ylim=c(8,15)) points(ID[ix.M],WT[ix.M],pch="+",col="blue") points(ID[ix.F],WT[ix.F],pch="+",col="red") So there is no simple road to applying a routine LDA to your data. To take account of different covariances between the two groups, you would normally be looking at a quadratic discriminator. However, as indicated above, the fact that a linear discriminator using the variables ID & WT alone works so well would leave considerable imprecision in conclusions to be drawn from its results. Sorry this is not the straightforward answer you were hoping for (which I confess I have not sought); it is simply a reaction to what your data say. Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <ted.hard...@manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 24-May-09 Time: 20:07:43 ------------------------------ XFMail ------------------------------ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.