Re: [R] Zero inflated: is there a limit to the level of inflation

Achim Zeileis Tue, 26 Jun 2012 14:48:15 -0700

On Tue, 26 Jun 2012, Marc Schwartz wrote:

On Jun 26, 2012, at 2:10 PM, SSimek wrote:

Hello,

I have count data that illustrate the presence or absence of individuals in
my study population. I created a grid cell across the study area and
calcuated a count value for each individual per season per year for each
grid cell. The count value is the number of time an individual was present
in each grid cell.  For illustration my data columns look something like
this and are repeated for each individual:

Cell_ID Param1  Param2  Param3  Param4  COUNT   Name    Year    Season  Cov
1       160.565994      729.08  1503    7930.3  0       AA      2010    AUT     
Open
1       160.565994      729.08  1503    7930.3  22      AA      2011    SPR     
Open
1       160.565994      729.08  1503    7930.3  12      AA      2009    SUM     
Open
1       160.565994      729.08  1503    7930.3  0       AA      2010    SUM     
Open
2       169.427001      491.87  1503.31 5101.09 0       AA      2010    AUT     
oldHard
2       169.427001      491.87  1503.31 5101.09 16      AA      2011    SPR     
oldHard
2       169.427001      491.87  1503.31 5101.09 0       AA      2009    SUM     
oldHard
2       169.427001      491.87  1503.31 5101.09 0       AA      2010    SUM     
oldHard
?
563     86.777099       612.69  977     4474.6  62      AA      2010    AUT     
Water
563     86.777099       612.69  977     4474.6  12      AA      2011    SPR     
Water
563     86.777099       612.69  977     4474.6  55      AA      2009    SUM     
Water


1       160.565994      729.08  1503    7930.3  0       BB      2010    SUM     
Open
2       169.427001      491.87  1503.31 5101.09 72      BB      2010    SUM     
oldHard
5       160.75  614.95  1503.31 2878.98 16      BB      2010    SUM     medHard
6       170.404998      510.58  1489.44 743.14  0       BB      2010    SUM     
Water
?
563     86.777099       612.69  977     4474.6  0       BB      2010    SUM     
Water


1       160.565994      729.08  1503    7930.3  14      C       2005    AUT     
Open
1       160.565994      729.08  1503    7930.3  0       C       2006    AUT     
Open
1       160.565994      729.08  1503    7930.3  0       C       2006    SPR     
Open
1       160.565994      729.08  1503    7930.3  56      C       2007    SPR     
Open
1       160.565994      729.08  1503    7930.3  0       C       2006    SUM     
Open
2       169.427001      491.87  1503.31 5101.09 124     C       2005    AUT     
oldHard
2       169.427001      491.87  1503.31 5101.09 231     C       2006    AUT     
oldHard
2       169.427001      491.87  1503.31 5101.09 889     C       2006    SPR     
oldHard
2       169.427001      491.87  1503.31 5101.09 0       C       2007    SPR     
oldHard
?
563     86.777099               612.69  977     4474.6  0       C       2005    
AUT     Water
563     86.777099               612.69  977     4474.6  231     C       2006    
AUT     Water
563     86.777099               612.69  977     4474.6  185     C       2006    
SPR     Water
563     86.777099               612.69  977     4474.6  123     C       2007    
SPR     Water
563     86.777099               612.69  977     4474.6  52      C       2006    
SUM     Water



I have 563 grid cells across my study area and each individual has 1-563
cells associated for each year and each season the individual was monitored.
Therefore my grid cells are repeated. I end up with 71,000 records and 925
records have a Count value >0; which means 70,075 records have a Count value
= 0.

I wanted to run a zero inflated poisson model to determine mixed effects (of
parameters) with individual as the random effect. But I have been advised
two things:

1. I cannot run a zero inflated poisson model because my data are too
"extremely" inflated (i.e. 70,075 vs 925) and

2. I cannot run the model with each cell repeated for each individual. I am
told the model doesn't recognize that Cell_ID #1 for individual "A" is the
same Cell_ID #1 for individual "B".

Does anyone know if either or both of these points are true? I would
appreciate any thoughts, advice, or suggestions.

Thanks!

-Stephanie



Hi Stephanie,

Some comments:

1. You should think about or at least be open to a zero inflated negative 
binomial distribution rather than zero inflated poisson.

2. You should at least review the vignette for the pscl CRAN package, which 
provides standard fixed effects models and related functions for count based 
data and importantly, some good conceptual content:

 http://cran.r-project.org/web/packages/pscl/vignettes/countreg.pdf

3. Given the repeated measures framework and correlation issues you likely 
have, you should subscribe to and re-post your query to the R-sig-mixed-models 
list:

 https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

which will avail you of experts in the field.

4. There is also a draft FAQ for mixed models here:

 http://glmm.wikidot.com/faq

which I believe is maintained by Ben Bolker, who actively participates in the 
above list. Based upon the content there, I suspect that you will be pointed to 
the glmmADMB package which is on R-Forge 
(http://glmmadmb.r-forge.r-project.org/) and can handle zero inflated mixed 
effects models of at least some types.

5. If all else fails, just to plant a seed, you might want to consider amixed effects logistic regression model with a binary response, sinceyou appear to have a relatively small "event" incidence in your data.The above list will also be helpful in that setting and you would likelybe pointed to the glmer() function in the lme4 package for thatapplication, which provides for GLMs in a mixed effects framework.


Thanks, Marc, all very useful points! Just one addition:

I would recommend starting with the last point - a binary responseregression (for y > 0). This could be considered as the zero-hurdle of ahurdle regression.

Hurdle regressions are an alternative to zero-inflated models, but havethe nice property that you can separately estimate both parts of thehurdle: (1) a binary regression for y=0 vs. y > 0. (2) A truncated countmodel for y, estimated only from the observations y>0. The "pscl" packagecontains a hurdle() function which estimates both parts in one go (and the"countreg" vignette gives more details and references), but in this caseit would probably be useful to estimate them separately.

In any case, both parts will need care because the binary responseprobably contains a lot of (quasi-)complete separations because non-zerosare so rare. Conversely, the truncated count model may be hard to estimatebecause there are no observations for a lot of parameter combinations. Butestimating the models separately will give you more flexibility inaddressing these issues.

To estimate the zero-truncated count distributions, you may consider the"countreg" package from R-Forge which uses the same code as (one part of)the hurdle() function.


hth,
Z

Regards,

Marc Schwartz

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Zero inflated: is there a limit to the level of inflation

Reply via email to