Re: [R] Discretize continous variables....

Frank E Harrell Jr Sat, 19 Jul 2008 18:20:23 -0700

milicic.marko wrote:

Frank/Danial,


Thank you for very good discussion on this.

The reason I'm doing this is because is it common industrial practice
to group continous varible (say age) in couple of buckets while
developming scorecards to be used by business people. I don't see the
reason why I shouldn't discretize variable AGE if manage to maintain
same information or reduce it slightly.

However, I do agree that reading your book will be of grait benefit.


Thanks a lot.... and keep discussion live

Thanks for your note. Categorizing age will adversely affect thescorecard. First, since you are introducing discontinuities into theprediction model, people can game the system to exploit thediscontinuity. Second, lost information from age will have to be madeup by adding another variable to the model that you might not haveneeded had the full age variable been adjusted for. Third, if you chopage into enough intervals to preserve the predictive value (hard to doespecially in the outer age ranges where sample sizes do not permitcutting but where the age effect is sharp) you will find that the meansquared error of predicted values is higher than if you treated age as acontinuous variable and just forced its effect to be smooth (e.g., usinga regression spline).


Frank






On Jul 19, 7:03 pm, Frank E Harrell Jr <[EMAIL PROTECTED]>
wrote:

Daniel Malter wrote:

True. Thanks for the clarification. Is your conclusion from that that the
findings in such case should only be interpreted in the specific context
(with the awareness that it does not apply to changing contexts) or that
such an approach should not be taken at all?

The latter, in general;  in specific cases the former.  But even then
why condition on incomplete information when complete information is
available?  I.e., why compute Pr(Y=1 | X>x) in place of Pr(Y=1 | X=x)?

Frank

Frank E Harrell Jr wrote:

Daniel Malter wrote:

This time I agree with Rolf Turner. This sounds like homework. Whether or
not, type
?ifelse
in the R-prompt.
Frank is right, it leads to a loss in information. However, I think it
remains interpretable. Further, it is common practice in certain fields,
and

I have to disagree.  It is easy to show that odds ratios so obtained are
functions of the entire distribution of the predictor in question.  Thus
they do not estimate a scientific quantity (something that can be
interpreted out of context).  For example if age is cut at 65 and one
were to add to the sample several subjects aged 100, the >=65 : <65 odds
ratio would change even if the age effect did not.

it maybe a reasonable way to check whether mostly outliers in the X drive
your results (although other approaches are available for that as well).
The
main underlying question however should be, do you have reason to expect
that the response is different by the groups you create rather than in
the
numbers of the continuous variable.

Regression splines can help.  Sometimes the splines are stated in terms
of the cube root of the predictor to avoid excess influence.
Frank

Regarding question 2: I thought you mean that you want to reduce the
number
of levels (say 4) to a smaller number of levels (say 2) for one of your
independent variables (i.e. one of the Xs), not Y. This makes sense only,
if
there is any good conceptual reason to group these categories - not just
to
get significance.
Best,
Daniel
Frank E Harrell Jr wrote:

milicic.marko wrote:

Hi R helpers,
I'm preparing dataset to fir logistic regression model with lrm(). I
have various cointinous and discrete variables and I would like to:
1. Optimaly discretize continous variables (Optimaly means, maximizing
information value - IV for example)

This will result in effects in the model that cannot be interpreted and
will ruin the statistical inference from the lrm.  It will also hurt
predictive discrimination.  You seem to be allergic to continuous
variables.

2. Regroup discrete variables to achieve perhaps smaller number of
level and better information value...

If you use the Y variable to do this the same problems will result.
Shrinkage is a better approach, or using marginal frequencies to combine
levels.  See the "pre-specification of complexity" strategy in my book
Regression Modeling Strategies.
Frank

Please suggest if there is some package providing this or same
functionality for discretization...
if there is no package plese suggest how to achieve this.



--
Frank E Harrell Jr   Professor and Chair           School of Medicine
                     Department of Biostatistics   Vanderbilt University

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Discretize continous variables....

Reply via email to