Re: [R] Separation issue in binary response models - glm, brglm, logistf

Xochitl CORMON Thu, 28 Feb 2013 08:59:30 -0800


Le 28/02/2013 17:22, Ben Bolker a écrit :

Thank you for your help !

Xochitl CORMON<Xochitl.Cormon<at> ifremer.fr> writes:

Dear all,

I am encountering some issues with my data and need some help. I am
trying to run glm analysis with a presence/absence variable as
response variable and several explanatory variable (time,
location, presence/absence data, abundance data).

First I tried to use the glm() function, however I was having 2
warnings concerning glm.fit () : # 1: glm.fit: algorithm did not
converge # 2: glm.fit: fitted probabilities numerically 0 or 1
occurred After some investigation I found out that the problem was
most probably quasi complete separation and therefor decide to use
brglm and/or

logistf.


* logistf : analysis does not run When running logistf() I get a
error message saying : # error in chol.default(x) : # leading minor
39 is not positive definite I looked into logistf package manual,
on Internet, in the theoretical and technical paper of Heinze and
Ploner and cannot find where this function is used and if the error
can be fixed by some settings.


chol.default is a function for Cholesky decomposition, which is going
to be embedded fairly deeply in the code ...

If I understand good I should just not use this package as this error isnot easily fixable ?

* brglm : analysis run However I get a warning message saying : #
In fit.proc(x = X, y = Y, weights = weights, start = start,
etastart # = etastart, : # Iteration limit reached Like before i
cannot find where and why this function is used while running the
package and if it can be fixed by adjusting some settings.

In a more general way, I was wondering what are the fundamental
differences of these packages.


You might also take a crack with bayesglm() in the arm package, which
should (?) be able to overcome the separation problem by specifying a
not-completely-uninformative prior.

Thank you for the tip I will have a look into this package and its doctomorrow. Do you have any idea of what is this fit.proc function ?

I hope this make sense enough and I am sorry if this is kind of
statistical evidence that I'm not aware of.

-----------------------------------------------------------------------

Here an extract of my table and the different formula I run :

head (CPUE_table)

Year Quarter Subarea Latitude Longitude Presence.S CPUE.S
Presence.H CPUE.H Presence.NP CPUE.NP Presence.BW CPUE.BW
Presence.C CPUE.C Presence.P CPUE.P Presence.W CPUE.W 1 2000 1 31F1
51.25 1.5 0 0 0 0 0 0 0 0 1 76.002 0 0 1 3358.667


[snip]

logistf_binomPres<- logistf (Presence.S ~ (Presence.BW + Presence.W
+ Presence.C + Presence.NP +Presence.P + Presence.H +CPUE.BW +
CPUE.H + CPUE.P + CPUE.NP + CPUE.W + CPUE.C + Year + Quarter +
Latitude + Longitude)^2, data = CPUE_table)

Brglm_binomPres<- brglm (Presence.S ~ (Presence.BW + Presence.W +
Presence.C + Presence.NP +Presence.P + Presence.H +CPUE.BW + CPUE.H
+ CPUE.P + CPUE.NP + CPUE.W + CPUE.C + Year + Quarter + Latitude +
Longitude)^2, family = binomial, data = CPUE_table)


It's not much to go on, but:


Yeah sorry my table header appeared really bad on the email :s

* are you overfitting your data? That is, do you have at least 20
times as many 1's or 0's (whichever is rarer) as the number of
parameters you are trying to estimated?

I have 16 explanatory variable and with interactions we go to 136parameters.


> length (which((CPUE_table)[,]== 0))
[1] 33466

> length (which((CPUE_table)[,]== 1))
[1] 17552

I assume the over fitting is good, isn't it?

* have you examined your data graphically and looked for any strong
outliers that might be throwing off the fit?

I did check my data graphically in a lot and different ways. However ifyou have any particular suggestions, please let me know. Concerningstrong outliers, I do not really understand what you mean. I haveoutliers here and there but how can I know that they are strong enoughto throw off the fit? Most of the time they are really high abundancecoming from the fact that I'm using survey data and probably related tothe fact that the boat fished over a fish school.

* do you have some strongly correlated/multicollinear predictors?

It's survey data so they indeed are correlated in time and space.However I checked the scatterplot matrix and I didn't notice any linearrelation between variable.

* for what it's worth it looks like a variety of your variables
might be dummy variables, which you can often express more compactly
by using a factor variable and letting R construct the design matrix
(i.e. generating the dummy variables on the fly), although that
shouldn't change your results

I will check about dummy variable concept as to be honest I don't reallyunderstand what it means...


Thank you again for your time and help

______________________________________________ R-help@r-project.org
mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do
read the posting guide

http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Separation issue in binary response models - glm, brglm, logistf

Reply via email to