Re: Applied analysis question

Brad Anderson Thu, 28 Feb 2002 07:43:27 -0800

Rich Ulrich <[EMAIL PROTECTED]> wrote in message 
news:<[EMAIL PROTECTED]>...
> On 27 Feb 2002 11:59:53 -0800, [EMAIL PROTECTED] (Brad Anderson)
> wrote:
> 
> > I have a continuous response variable that ranges from 0 to 750.  I
> > only have 90 observations and 26 are at the lower limit of 0, which is
> > the modal category.  The mean is about 60 and the median is 3; the
> > distribution is highly skewed, extremely kurtotic, etc.  Obviously,
> > none of the power transformations are especially useful.  The product
> 
> I guess it is 'continuous'  except for having 26 ties at 0.  
> I have to wonder how that set of scores arose, and also, 
> what should a person guess about the *error*  associated
> with those:   Are the numbers near 750  measured with
> as much accuracy as the numbers near 3?


I should have been more precise.  It's technically a count variable
representing the number of times respondents report using dirty
needles/syringes after someone else had used them during the past 90
days.  Subjects were first asked to report the number of days they had
injected drugs, then the average number of times they injected on
injection days, and finally, on how many of those total times they had
used dirty needles/syringes.  All of the subjects are injection drug
users, but not all use dirty needles.  The reliability of reports near
0 is likely much better than the reliability of estimates near 750. 
Indeed, substantively, the difference between a 0 and 1 is much more
significant than the difference between a 749 and a 750--0 represents
no risk, 1 represents at least some risk, and high values--regardless
of the precision, represent high risk.
> 
> How do zero scores arise?  Is this truncation;  the limit of
> practical measurement;  or just what?

Zero scores are logical and represent no risk, negative values are not
logical.
> 
> "Extremely kurtotic," you say.  That huge lump at 0 and skew
> is not consistent with what I think of as kurtosis, but I guess
> I have not paid attention to kurtosis at all, once I know that
> skewness is extraordinary.

True, the kurtosis statistic exceeded 11, and and a plot against the
normal indicates a huge lump in the low end of the tail, and also a
larger proportion of very high values than expected.
> 
> Categorizing the values into a few categories labeled, 
> "none, almost none, ...."  is one way to convert your scores.  
> If those labels do make sense.

Makes sense at the low end 0 risk.  And at the high end I used 90+
representing using a dirty needle/syringe once a day or more often. 
The 2 middle categories were pretty arbitrary.

If I analyze a contingency Table using the 4-category response and a
3-category measure of the primary covariate (categories defined using
"clinically meaningful" categories, the association is quite strong
and I used the exact p-value associated with the CMH difference in row
means test (using SAS) and the association is signficant.  I also used
the 3-category predictor and the procedures outlined by Stokes et al.
(2000) to estimate a rank analysis of covariance--again with
consistent results.

I've also run a few other analyses I didn't describe.  I used the
Box-Cox procedure to find a power transformation.  Although the
skewness statistic then looks great, the distribution is still not
approximately normal.  However, a regression using the transformed
variable is consistent with the ordered logit and the contingency
table analysis.

One of the other posters asked about the appropriate error term--I
guess that lies at the heart of my inquiry.  I have no idea what the
appropriate error term would be, and to best model such data.  I often
deal with similar response variables that have distributions in which
observations are clustered at 1 or both ends of the continuum.  In
most cases, these distributions are not even approximately unimodal
and a bit skewed--variables for which normalizing power
transformations make sense.  Additionally, these typically aren't
outcomes that could be thought of as being generated by a gaussian
process.

In some cases I think it makes sense to consider poisson and
generalizations of poisson processes although there is clearly much
greater between subject heterogeneity than assumed by a poisson
process.  I estimated poission and negative binomial regression
models--there was compelling evidence that the poission was
overdispersed.  I also used a Vuong statistic to compare NB regression
with zero-inflated NB regression--the results support the
zero-inflated model.  The model standard errors for a zero-inflated
model are wildly different than the Huber-White sandwich robust
standard errors.  The later give results that are fairly consistent
with the ordered logit, the model based standard errors are
huge--given that these are asymptotic statistics and I have a
relatively small sample, I don't really trust either.

I think a lot of folks just run standard analyses or arbitrarily apply
some "normalizing" transformation because that's whats done in their
field.  Then report the results without really examining the
underlying distributions.  I'm curious how folks procede when they
encounter very goofy distrubions.  Thanks for your comments.


=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
                  http://jse.stat.ncsu.edu/
=================================================================

Re: Applied analysis question

Reply via email to