Aargh! I forgot that reply doesn't go to the whole list. So this is something I intended to send yesterday.
To emphasise one of the points here, which people seem to be missing - the purpose of a ZIP/ZINB model is to allow for separate processes affecting presence/absence and abundance given presence. Analysing the data separately has two problems: 1. The presence/absence analysis confounds the observations of zeroes where the species isn't there with the observations of zeroes where the species is there but not sampled. 2. The abundance analysis over-estimates mean abundance because it doesn't have any zeroes, where the species is present but not sampled. The ZIP/ZINB models work by allowing for species to be present but not observed. The paper which first proposed the ZIP model suggested fitting it by estimating the number of zeroes where the species was present, and then estimating the other paramters, and from those re-estimating the "false" zeroes, and iterating between the two. I haven't checked the methods Alain was suggesting, but I suspect they use the same approach (it's now called the EM algortihm). Bob -------- Original Message -------- Subject: Re: Data set with many many zeros..... Help? Date: Sun, 13 Jan 2008 18:38:41 +0200 From: Anon. <[EMAIL PROTECTED]> To: Highland Statistics Ltd. <[EMAIL PROTECTED]> References: <[EMAIL PROTECTED]> Highland Statistics Ltd. wrote: > On Sat, 12 Jan 2008 15:38:33 -0400, Stephen Cole <[EMAIL PROTECTED]> wrote: > >> Hello Ecolog - I was wondering if anyone had any advice on the following >> problem. >> >> I have a data set that is infested by a plague of zeros that is causing me >> to violate all assumptions of classic parametric testing. These are true >> zeros in that the organisms in question did not occur in my randomly > sampled >> quadrats. They are not "missing data" >> >> I have a fully nested Hierarchical design >> My response variable is density obtained from quadrat counts. >> my explanatory variables are as follows >> >> Region (3 levels-fixed) >> Location(Region) (4 levels - random >> Site(Location(Region)) (4 levels - random) >> >> My plan was to analyze the data with a nested anova and then proceed to >> calculate variance components to allow me to parse out the variance that >> could be attributed to each spatial scale in my design. Since it is known >> that violations of assumptions severely distort variance components in >> random factors, i would really like to clean up my data set to meet the >> assumptions but as of yet i have found no acceptable remedial measure. >> > > Stephen, > The good news for you is that this is a common problem; it is called zero > inflation. The solution is zero inflated Poisson, zero inflated negative > binomial, zero altered poisson, or zero altered negative binomial GLMs. > These are mixture models. Just Google ZIP, ZINB, ZAP, ZANB (or hurdle > models). There is a nice online pdf from Zeileis, Kleiber and Jackman, > showing you how to do these analyses in R. The book from Cameron and ' > Trivedi gives the maths. Our next book has a 40 page chapter on this stuff > (in R), but that won't help you now. > > The difference between ZI and ZA is the nature of the zeros (false zeros > or true zeros), and the difference between Poisson and NB is wether you > have extra overdispersion due to the counts, or only due to the zeros. > > Software in R for this stuff is reasonably new. Packages pscl and VGAM are > good starting points. > > The bad news is that I am not sure what you have in terms of software for > ZIPs + random effects. Both Cameron and Trivedi and Hilbe (2007) discuss > these methods in the context of random effects. There was a paper in > Environmetrics (end of 2007) applying ZIP with spatial/temporal > correlation on seal data...in R. There are more, all very recent, papers > with ZIP/ZAP + random effects. You may have to write the software code for > doing this...I don't know. > > Having said that...you say that your random effects have 4 levels. I doubt > if this is enough! Perhaps you should consider them as fixed? See Pinheiro > and Bates. > > ZIP/ZAP is very interesting stuff! > I would mostly add "I agree" to what Alain has written, but just add a couple of comments: 1. It might be that a negative binomial is sufficient - that alone can produce lots of zeroes. It depends a bit on whether you think a sufficient proportion of the zeroes are because the species genuinely aren't present, as opposed to being present and not recorded in the sample (which is what the Poisson or negative binomial assume). 2. If you don't mind being Bayesian (and who in their right mind wouldn't :-)), the models are fairly easy to set up in BUGS. If R will give you what you need, use it first (it's easier!), but BUGS would be easier than coding the stuff yourself. 3. I would agree that 4 levels probably aren't enough. Treating them as fixed and calculating the variances from the point estimates is probably no worse than treating them as fixed, but the estimates are going to have large standard errors - you're estimating a variance from 4 data points (and those data points are estimated as well!). To give yourself some idea about this, if the data were balanced and behaving well, and the true variances were equal, then the ratio of the estimated variances would follow an F(3,3) distribution. The 80% confidence interval would be (0.19, 5.4). IOW, even if the varaines are equal, the estimate of one could still be 5 times the estimate of the other. Bob -- Bob O'Hara Department of Mathematics and Statistics P.O. Box 68 (Gustaf Hällströmin katu 2b) FIN-00014 University of Helsinki Finland Telephone: +358-9-191 51479 Mobile: +358 50 599 0540 Fax: +358-9-191 51400 WWW: http://www.RNI.Helsinki.FI/~boh/ Blog: http://deepthoughtsandsilliness.blogspot.com/ Journal of Negative Results - EEB: www.jnr-eeb.org
