Date: 5 JUN 2002 22:36:06 -0700
From: Sapsi <[EMAIL PROTECTED]>
> We are doing a regression, however there a plenty of missing values,
> which cannot be dropped, so somehow have to be imputed.
> Ther are some options,
> a) Replace the problem regressor by a dummy variable
> b) Replace the missing by median (it is demographic data)
> c) Replace the missing data by the trimmed/winsorized mean.
> d) Replace by mode
>
> Which is better altenative??
It depends on the joint distribution function h(x,y).
You can get some insight from the (x,y) scatter plot and normalized
marginal histogram f(xi) (i.e., 1 = SUM(1=1,N){f(xi)} ).
If the marginal is multimodal or weird in some other way, one of
the above choices probably won't work.
If the missing values appear to be uncorrelated with y:
1. Reduce the number of bins in the histogram until you obtain m which
you consider minimally representative.
2. Use those representative values to create m regressions.
3. Average over the m sets of regression coefficients with weights f(xi) to
obtathe final result.
I don't know if this will work if the missing values are correlated with y.
Greg
Hope this helps.
Gregory E. Heath [EMAIL PROTECTED] The views expressed here are
M.I.T. Lincoln Lab (781) 981-2815 not necessarily shared by
Lexington, MA (781) 981-0908(FAX) M.I.T./LL or its sponsors
02420-9185, USA
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
. http://jse.stat.ncsu.edu/ .
=================================================================