[R-sig-eco] Logistic regression with 2 categorical predictors

Dixon, Philip M [STAT] Fri, 24 Oct 2014 07:31:13 -0700

Andrew,

Your problems arise from two issues: 
1) The overall test of equal P[prefer] in each age and the tests of equal 
P[prefer] in a specific pair of ages  are based on different testing 
philosophies (Score or LR test for the overall test and Wald test for the 
pairwise comparisons).
2) The behaviour of the logit function when one empirical probability is 0 (or 
1).
You are absolutely correct when you highlight the ages with all avoid.


If you want  just my suggestion for how to conduct informative pairwise 
comparisons: Do 15 two-by-two Chi-square tests, or 15 runs of glm with only two 
ages in each analysis.  Because there is no pooling of information from ages 
uninvolved in the null hypothesis, these 2x2 tests are exactly the same as the 
pairwise hypothesis for the full 6-age data.  I don't know whether there is a R 
packge to automate this. I've always written a pair of for loops and used 
subset= with the appropriate logical in the glm().

Details (which are a very very small glimpse of a huge area):  Fleiss, analysis 
of counts and proportions, and the tome: Agresti, categorical data analysis are 
extended treatments).  It doesn't help that the analysis of this type of data 
has some non-trivial complications.  Upton, in his classic JRSS discussion 
paper on the 2x2 contingency table, said something like the 2x2 contingency 
table was a seemingly simple problem that had a surprisingly large number of 
complications.

1) Why is a 2x2 contingency table comparison of ages 1 and 2 the same as the 
pairwise comparison of ages 1 and 2 in an analysis of all 6 ages.  When the 
data are normally distributed and analyzed by ANOVA, the error variance is 
(usually) estimated from all 6 groups.  The analysis of just two groups 
estimates the error variance from fewer observations, hence the 6-group pooled 
error variance provides improved power for the comparison of groups 1 and 2.  
The null hypothesis in both cases specifies equality of ages 1 and 2, but all 6 
groups provide information about the error variance.  With count data, there is 
no error variance that has to be estimated.  (I'm not considering 
overdispersion, which if present would be estimated from all 6 groups, so the 
2x2 and "6x2 followed by a pairwise comparison" analyses are no longer 
equivalent).  Again, the null hypothesis is the same in both analyses.  One way 
to consider the specific comparison of ages 1 and 2 in the 6-age analysis is !
 that you are fitting a model with 5 probabilities to the data: one probability 
that is shared by ages 1 and 2 (hence they have equal probabilities) and four 
more probabilities, one each for ages 3 through 6.   The fit of that 5 
parameter model is compared to the fit of a 6 parameter model in which each age 
has its own probability.  Both the 2 x 2 and the pairwise comparison are 1 df 
tests.  In neither case, can the comparison of ages 1 and 2 benefit from any 
information from the other four groups.

2) What's going on the anova(model_sg, test="Chisq") analysis?  This is a 
comparison of two models.  The data and the models can be set up in at least 
three asymptotically equivalent ways (see Agresti and probably Fleiss for 
"sampling models for the contingency table" for the details).  I'll use the 
independent binomial model.  That model has up to 6 parameters, i.e. a 
P[prefer] for each age, where P[prefer] is the probability that a fish in a 
specified age group has a "Prefer" response, not an "Avoid" response.  The null 
hypothesis (reduced) model is that each age has the same probability, so there 
is one parameter.  The alternative (full) model is that each age has a 
different probability.  Yes, the alternative model is fully specified.  Yes, 
the alternative model has 0 "error" df.  When you fit these models using glm, 
you are using deviance = -2 log likelihood as the measure of fit.  Yes, the 
deviance is 0 for the full model (under the usual way of writing the log 
likelih!
 ood).  That is exactly how things should be.  The comparison between these two 
models is a measure of whether the reduced model fits the data.  (semi-Aside: 
You can write the log-likelihood for the full model so the deviance is not 0.  
If so, the log-likelihoods for both the null and alternative models are shifted 
by the same additive constant so the change in deviance is the same value for 
either way to write the log-likelihood).  This comparison of models is exactly 
the same comparison done by the usual (K-1) df Chi-square test of equal 
proportions in a Kx2 contingency table.  The only difference is that the 
Chi-square test uses the Pearson Chi-square statistic as the measure of how 
well  a model fits the data; glm uses deviance.  We regularly calculate the 
Chi-square statistic for the null hypothesis.  We never explicitly calculate 
Chi-square for the alternative hypothesis, because Chi-square for the full 
model is exactly 0 with exactly 0 df.  The change between the two !
 models is the usual Chi-square test statistic.

3) Why are counts of 0 a problem for an analysis using logit probabilities?  If 
you fit a glm(cbind(Prefer,Avoid) ~ -1+Age, ...) you would fit a 6-parameter 
model with one parameter for each group, without having to worry about R's full 
rank contrast parameterization.  The problems with 0's would be immediately 
apparent.  The parameters for ages 5 and 6 would be -25 or so.  That is a 
numerical approximation, the estimates should be -infinity.  That's because 
logit(0) = log [0/(1-0) ] = -infinity.  Any value from ca -10 to -1000000 will 
give a probability that is essentially 0.  The se's of those estimates are 
huge, which is no longer surprising, because any value in a huge range of large 
negative numbers fits equally well (because all transform back to a probability 
essentially 0; I don't think there is much practical difference between 
P[prefer]= 1E-30 or 1E-10 or even 0.00001 when the number of individuals is 
between 5 and 30).

If the estimate for age 5 is poor, the comparison of that estimate to that for 
age 1 will be poor.  Hence the large se's for any comparison to age 5 or age 6. 
 The problem is the testing philosophy, not the biological value of the test.  
The pairwise comparisons between specific groups use the estimates and their 
standard errors under the alternative hypothesis (Wald test).  The overall (5 
df) test uses the change in deviance between models (glm, likelihood ratio, LR 
test) or the fit of the data under the null hypothesis (Chi-square test, a 
score test).

If you really have to fit a model with probabilities very close to or equal to 
0, there are ways to get better-behaved estimates, e.g. using Firth's penalized 
likelihood or a Bayesian approach.  I don't believe that is necessary here.

As I said earlier, my practical solution would be to use a series of 2x2 
contingency table analyses.  You won't be able to compare groups 5 and 6, but 
you shouldn't want to do any statistical test there.  If you want a 
multiple-comparisons adjustment, multiply all the usual p-values by 15, which 
is a Bonferroni correction for all pairs of 6 groups.  That approach answers 
your biological and avoids comparing age-specific estimates.

My apologies for the long post to the group.  It couldn't be short because the 
issues are multiple, not trivial, and needed to be raised.  I would be happy to 
have a short further discussion with you, Andrew, if you want, but I suggest 
that be off the list. 

Philip Dixon

_______________________________________________
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Logistic regression with 2 categorical predictors

Reply via email to