Re: normality

Highland Statistics Ltd. Fri, 25 Aug 2006 09:39:19 -0700

At 16:27 25/08/2006, DeSolla,Shane [Burlington] wrote:

>Sorry to take this off list, but I wasn't sure 
>if I was grossly misunderstanding what you two 
>are saying in the following email exchange. My 
>apologies if I am not reading you correctly. See my comments below:
>
> > -----Original Message-----
> > From: Ecological Society of America: grants, jobs, news
> > 
> [<mailto:[email protected]>mailto:[EMAIL PROTECTED] 
> On Behalf Of Highland
> > Statistics Ltd.
> > Sent: Friday, August 25, 2006 9:30 AM
> > To: [email protected]
> > Subject: Re: PCA question
> >
> > On Thu, 24 Aug 2006 11:04:42 -0300, James J. Roper <[EMAIL PROTECTED]>
> > wrote:
> >
> > >Steve,
> > >
> >
> > Dear Jim,
>
><< SNIPPED >>
>
> > >The variance is only a good estimate of the "true" variance if the
> > >distribution is normal or transformable to normality, and
> > so, normality
> > >is required. Correlations, to be meaningful, also require
> > normality, as
> > >the statistic program is not using a covariance matrix based
> > on ranks
> > >(Spearman).
>
>What has to be normal? The raw data?


No..not the raw data...that is a misconception. 
You have to assume that if you would repeat the 
sampling at the same environmental conditions, 
then you will measure very similar values. 
Suppose you have the money/time/energy to do 
this...go 100 times into the field at the same 
environmental conditions, and sample (have fun). 
If you then make a scatterplot of your Y versus 
your X you would hope to see a bell-shaped curve 
on top of the scatter plot showing the range of 
all possible realisations. If it is really a 
bell-shaped pattern you can assume normality. If 
the spread at each X value is also the same, you 
can assume homogeneity. Very often this is not 
the case..so you take a hammer and knock on the 
data to make sure that the spread of the data at 
each X value is the same (and that is called a 
transformation). More elegant options are 
available (e.g. different variances per 
strata)...see for example chapter 5 in Pinheiro and Bates for further options.

Very often, people don't have multiple 
observations per X value....very often only one 
(especially in filed studies). So..technically 
you can't check for normality or 
homogeneity...you can only pull all the residuals 
and hope that these are normally distributed. But it is not conclusive.

Now...confusion arises because of normality of 
the raw data of the 
residuals....well...technically you can show that 
normality of the raw data (Y), given the X, 
implies normality of the residuals. So..you have 
to assume that the X are without error....or else 
it all goes wrong. There is some text in Faraway 
(2004) that shows how and why and where it goes wrong.


>Or the error? I am no statistician, but it 
>sounds like you are talking about the 
>distribution of the data. If so, why should the 
>raw data be normally distributed?

As explained above...it is the residuals. I would 
not recommend checking for normality of the raw 
data. I see students panicking with bimodal 
histograms of raw data..only to discover that the 
bimodality is caused by a sex effect...and the 
residuals were perfectly normally distributed.




> > In some situations perhaps yes..but I can also imagine
> > situations in which this does not hold. Suppose you are
> > interested in the correlation between a species abundance and
> > temperature. Assumming bivariate normality means that each of
> > the variables should be normally distributed. So..most of
> > your temperature values should be clumbed around a certain
> > value.... If all the fun happens in this specific temperature
> > regime, then that is fine. But if you have long gradients, it
> > is perhaps better to take equal number of samples along the
> > temperature gradient (this is also one of the assumptions in
> > methods like canonical correspondence analysis and redundancy
> > analysis...see Ter Braak 1986).
>
>Again, it sounds like you two are arguing that 
>the data has to be normally distributed.

No ..the residuals. Other things you should do is:
1. plot residuals versus fitted values. Check 
whether the spread is the same everywhere. If 
not, you are in trouble (heterogeneity). 
Solution: add more covariates, improve your 
model, add interactions, allow for different 
variances using GLS or mixed modelling, etc. 
Consider a Poisson distribution, or something stronger
2. Plot residuals versus each explanatory 
variable. You don't want to see any patterns. If 
you do see patterns.....trouble. Consider adding 
more covariates, different model or apply 
smoothing methods like GAM, among many other options.
3. Investigate the model for influential observations.
4. Check for independence.

If any of these points is violated, then you are 
in trouble. I still have to see a publication 
using ecological data in which linear regression 
is applied correctly. Anyone who has, please send 
me the pdf and data so that I can use it in 
academic courses. I can pinpoint various stats 
books and Nature papers where the results show 
residual patterns, violation of independence, different spread.


So far a quick regression course.

Alain
www.highstat.com




>Here is a thought experiment (though you can try 
>this if you like). Take a uniform distribution 
>(very non-normal, right?). Take a random sample 
>of, say 12 observations. Calculate the mean. 
>Take another sample of 12, calculate the mean. 
>Repeat 1000 times. Plot the distribution of the 
>randomly generated means. I bet the distribution 
>of the means will be approximately normal, even 
>though the raw data definitely isn't. Hence, the 
>assumption of ANOVAs, for example, is that the 
>expected distribution of the means is normal 
>(or, if you like, the residuals), not the raw 
>data. Ditto for regressions. If the raw data is 
>normally distributed, that is a sufficient 
>condition for the residuals to be normal, but it 
>is not a necessary condition. Hence, if you show 
>the data is normal every thing is fine, but if 
>the raw data is not normal, that alone is not 
>sufficient for the assumptions of ANOVAs 
>(correlations, regressions, etc) to be violated. 
>Do not the assumptions of multivariate normality follow similar logic?
>
>Or, as I said, am I completely misreading what you two are saying?
>
>I have snipped the rest of the discussion, as 
>you seemed to have (correctly, in my 
>not-so-educated opinion) switched to discussing 
>assumptions around the residuals, rather than 
>the raw data. But it doesn't seem to fit what you have discussed earlier.
>
>Hope you don't mind this email - I am still trying to learn
>
>Cheers,
>Shane
>
>_____________________________________________
>Shane de Solla
>Wildlife Conservation Biologist
>Canadian Wildlife Service
>Canada Centre for Inland Waters
>867 Lakeshore Road
>Box 5050
>Burlington, ON
>L7R 4A6
>Canada
>
>phone   905-336-4686
>fax        905-336-6434
>
>Opinions expressed are those of the author and 
>do not represent those of his employer.
>



Dr. Alain F. Zuur
Highland Statistics Ltd.
6 Laverock road
UK - AB41 6FN Newburgh

Tel: 0044 1358 788177
Email: [EMAIL PROTECTED]
URL: www.highstat.com
URL: www.brodgar.com

Our statistics courses:
1. "Analysing biological and environmental data using univariate methods".
2. "Analysing biological and environmental data using multivariate methods"
3. "Analysing biological and environmental data using time series analysis"
4. "Analysing biological and environmental data 
using mixed modelling, GLMM and GAMM"
5. "An introduction to R"

Brodgar: Software for univariate and multivariate 
analysis and multivariate time series analysis
Brodgar complies with R GNU GPL license

Statistical consultancy, courses, data analysis and software

Re: normality

Reply via email to