Re: Data set with many many zeros..... Help?

Anon. Mon, 14 Jan 2008 09:16:03 -0800

Aargh!  I forgot that reply doesn't go to the whole list.  So this is 
something I intended to send yesterday.


To emphasise one of the points here, which people seem to be missing - 
the purpose of a ZIP/ZINB model is to allow for separate processes 
affecting presence/absence and abundance given presence.  Analysing the 
data separately has two problems:
1. The presence/absence analysis confounds the observations of zeroes 
where the species isn't there with the observations of zeroes where the 
species is there but not sampled.
2. The abundance analysis over-estimates mean abundance because it 
doesn't have any zeroes, where the species is present but not sampled.

The ZIP/ZINB models work by allowing for species to be present but not 
observed.  The paper which first proposed the ZIP model suggested 
fitting it by estimating the number of zeroes where the species was 
present, and then estimating the other paramters, and from those 
re-estimating the "false" zeroes, and iterating between the two.  I 
haven't checked the methods Alain was suggesting, but I suspect they use 
the same approach (it's now called the EM algortihm).

Bob

-------- Original Message --------
Subject: Re: Data set with many many zeros..... Help?
Date: Sun, 13 Jan 2008 18:38:41 +0200
From: Anon. <[EMAIL PROTECTED]>
To: Highland Statistics Ltd. <[EMAIL PROTECTED]>
References: <[EMAIL PROTECTED]>

Highland Statistics Ltd. wrote:
> On Sat, 12 Jan 2008 15:38:33 -0400, Stephen Cole <[EMAIL PROTECTED]> wrote:
> 
>> Hello Ecolog - I was wondering if anyone had any advice on the following
>> problem.
>>
>> I have a data set that is infested by a plague of zeros that is causing me
>> to violate all assumptions of classic parametric testing.  These are true
>> zeros in that the organisms in question did not occur in my randomly 
> sampled
>> quadrats.  They are not "missing data"
>>
>> I have a fully nested Hierarchical design
>> My response variable is density obtained from quadrat counts.
>> my explanatory variables are as follows
>>
>> Region                       (3 levels-fixed)
>> Location(Region)         (4 levels - random
>> Site(Location(Region))  (4 levels - random)
>>
>> My plan was to analyze the data with a nested anova and then proceed to
>> calculate variance components to allow me to parse out the variance that
>> could be attributed to each spatial scale in my design.  Since it is known
>> that violations of assumptions severely distort variance components in
>> random factors, i would really like to clean up my data set to meet the
>> assumptions but as of yet i have found no acceptable remedial measure.
>>
> 
> Stephen,
> The good news for you is that this is a common problem; it is called zero 
> inflation. The solution is zero inflated Poisson, zero inflated negative 
> binomial, zero altered poisson, or zero altered negative binomial GLMs. 
> These are mixture models. Just Google ZIP, ZINB, ZAP, ZANB (or hurdle 
> models). There is a nice online pdf from Zeileis, Kleiber and Jackman, 
> showing you how to do these analyses in R. The book from Cameron and 
'
> Trivedi gives the maths. Our next book has a 40 page chapter on this stuff 
> (in R), but that won't help you now.
> 
> The difference between ZI and ZA is the nature of the zeros (false zeros 
> or true zeros), and the difference between Poisson and NB is wether you 
> have extra overdispersion due to the counts, or only due to the zeros.
> 
> Software in R for this stuff is reasonably new. Packages pscl and VGAM are 
> good starting points.
> 
> The bad news is that I am not sure what you have in terms of software for 
> ZIPs + random effects. Both Cameron and Trivedi and Hilbe (2007) discuss 
> these methods in the context of random effects. There was a paper in 
> Environmetrics (end of 2007) applying ZIP with spatial/temporal 
> correlation on seal data...in R. There are more, all very recent, papers 
> with ZIP/ZAP + random effects. You may have to write the software code for 
> doing this...I don't know.
> 
> Having said that...you say that your random effects have 4 levels. I doubt 
> if this is enough! Perhaps you should consider them as fixed? See Pinheiro 
> and Bates.
> 
> ZIP/ZAP is very interesting stuff!
> 
I would mostly add "I agree" to what Alain has written, but just add a
couple of comments:
1. It might be that a negative binomial is sufficient - that alone can
produce lots of zeroes.  It depends a bit on whether you think a
sufficient proportion of the zeroes are because the species genuinely
aren't present, as opposed to being present and not recorded in the
sample (which is what the Poisson or negative binomial assume).
2. If you don't mind being Bayesian (and who in their right mind
wouldn't  :-)), the models are fairly easy to set up in BUGS.  If R will
give you what you need, use it first (it's easier!), but BUGS would be
easier than coding the stuff yourself.
3. I would agree that 4 levels probably aren't enough.  Treating them as
fixed and calculating the variances from the point estimates is probably
no worse than treating them as fixed, but the estimates are going to
have large standard errors - you're estimating a variance from 4 data
points (and those data points are estimated as well!).  To give yourself
some idea about this, if the data were balanced and behaving well, and
the true variances were equal, then the ratio of the estimated variances
would follow an F(3,3) distribution.  The 80% confidence interval would
be (0.19, 5.4).  IOW, even if the varaines are equal, the estimate of
one could still be 5 times the estimate of the other.

Bob

-- 
Bob O'Hara
Department of Mathematics and Statistics
P.O. Box 68 (Gustaf Hällströmin katu 2b)
FIN-00014 University of Helsinki
Finland

Telephone: +358-9-191 51479
Mobile: +358 50 599 0540
Fax:  +358-9-191 51400
WWW:  http://www.RNI.Helsinki.FI/~boh/
Blog: http://deepthoughtsandsilliness.blogspot.com/
Journal of Negative Results - EEB: www.jnr-eeb.org

Re: Data set with many many zeros..... Help?

Reply via email to