Re: [R] Appropriate regression model for categorical variables

2007-06-13 Thread Moshe Olshansky
Tirtha wrote:

Dear users,
In my psychometric test i have applied logistic
regression on my data. 
My
data consists of 50 predictors (22 continuous and 28
categorical) plus 
a
binary response. 

Using glm(), stepAIC() i didn't get satisfactory
result as 
misclassification
rate is too high. I think categorical variables are
responsible for 
this
debacle. Some of them have more than 6 level (one has
10 level).

Please suggest some better regression model for this
situation. If 
possible
you can suggest some article.

thanking you.

Tirtha


Hi Tirtha,

Are your categorical variables really categorical? 
What I mean is if you variable is user's satisfaction
level (0 for very unsatisfied, 1 for moderately
unsatisfied, 2 for slightly unsatisfied, 4 for
neutral, etc., finally 7 for very satisfied) then your
variable is not really categorical (since 1 is closer
to 3 than to 6) and then try what other people
suggest.  However, if your variable is, say, the 50-th
amino acid in a certain gene (with values of 1 for the
first amino acid, 2 for the second one,...,20 for the
20-th one) then your variable is really categorical
(you generally can not say that amino acid 2 is much
closer to amino acid 3 than to amino acid 17).  In
such a case I would have tried classification method
which can treat categorical variables or,
alternatively,  may be regression trees (i.e. split on
the values of categorical variables and at each node
find regression coefficients of the continuous
variables).

Regards,

Moshe Olshansky
[EMAIL PROTECTED]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Appropriate regression model for categorical variables

2007-06-12 Thread Robert A LaBudde
At 01:45 PM 6/12/2007, Tirtha wrote:
Dear users,
In my psychometric test i have applied logistic regression on my data. My
data consists of 50 predictors (22 continuous and 28 categorical) plus a
binary response.

Using glm(), stepAIC() i didn't get satisfactory result as misclassification
rate is too high. I think categorical variables are responsible for this
debacle. Some of them have more than 6 level (one has 10 level).

Please suggest some better regression model for this situation. If possible
you can suggest some article.

1. Using if a factor has many levels, there is a natural order to the 
levels. If so, consider fitting the factor as an ordered factor.

2. Break the factor levels into 2 or 3 groups that have some rational 
connection. Then fit the factor with a smaller number of levels. 
E.g., race might have levels white, black, asian, pacific, 
Spanish surname, other. Consider a change to white, nonwhite.


Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Appropriate regression model for categorical variables

2007-06-12 Thread Ted Harding
On 12-Jun-07 17:45:44, Tirthadeep wrote:
 
 Dear users,
 In my psychometric test i have applied logistic regression
 on my data. My data consists of 50 predictors (22 continuous
 and 28 categorical) plus a binary response. 
 
 Using glm(), stepAIC() i didn't get satisfactory result as
 misclassification rate is too high. I think categorical
 variables are responsible for this debacle. Some of them have
 more than 6 level (one has 10 level).
 
 Please suggest some better regression model for this situation.
 If possible you can suggest some article.

I hope you have a very large number of cases in your data!

The minimal complexity of the 28 categorical variables compatible
with your description is

  1 factor at 10 levels
  2 factors at 7 levels
 25 factors at 2 levels

which corresponds to (2^25)*(7^2)*10 = 16441671680 ~= 1.6e10
distinct possible combinations of levels of the factors. Your
true factors may have far more than this.

Unless you have more cases than this in your data, you are
likely to fall into what is called linear separation, in which
the logistic regression will find a perfect predictor for your
binary outcome. This prefect predictor may well not be unique
(indeed if you have only a few hundred cases there will probably
be millions of them).

Therefore your logistic reggression is likely to be meaningless.

I can only suggest that you consider very closely how to

a) reduce the numbers of levels in some of your factors,
   by coalescing levels together;
b) defining new factors in terms of the old so as to reduce
   the total number of factors (which may include ignoring
   some factors altogether)

so that you end up with new categorical variables whose total
number of possible combinations is much smaller (say at most 1/5)
of the number of cases in your data.

In summary: you have to many explanatory variables.

Best wishes,
Ted.


E-Mail: (Ted Harding) [EMAIL PROTECTED]
Fax-to-email: +44 (0)870 094 0861
Date: 13-Jun-07   Time: 00:23:49
-- XFMail --

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.