Rich Ulrich <[EMAIL PROTECTED]> wrote in message news:<[EMAIL PROTECTED]>... > On 27 Feb 2002 11:59:53 -0800, [EMAIL PROTECTED] (Brad Anderson) > wrote: > > > I have a continuous response variable that ranges from 0 to 750. I > > only have 90 observations and 26 are at the lower limit of 0, which is > > the modal category. The mean is about 60 and the median is 3; the > > distribution is highly skewed, extremely kurtotic, etc. Obviously, > > none of the power transformations are especially useful. The product > > I guess it is 'continuous' except for having 26 ties at 0. > I have to wonder how that set of scores arose, and also, > what should a person guess about the *error* associated > with those: Are the numbers near 750 measured with > as much accuracy as the numbers near 3?
I should have been more precise. It's technically a count variable representing the number of times respondents report using dirty needles/syringes after someone else had used them during the past 90 days. Subjects were first asked to report the number of days they had injected drugs, then the average number of times they injected on injection days, and finally, on how many of those total times they had used dirty needles/syringes. All of the subjects are injection drug users, but not all use dirty needles. The reliability of reports near 0 is likely much better than the reliability of estimates near 750. Indeed, substantively, the difference between a 0 and 1 is much more significant than the difference between a 749 and a 750--0 represents no risk, 1 represents at least some risk, and high values--regardless of the precision, represent high risk. > > How do zero scores arise? Is this truncation; the limit of > practical measurement; or just what? Zero scores are logical and represent no risk, negative values are not logical. > > "Extremely kurtotic," you say. That huge lump at 0 and skew > is not consistent with what I think of as kurtosis, but I guess > I have not paid attention to kurtosis at all, once I know that > skewness is extraordinary. True, the kurtosis statistic exceeded 11, and and a plot against the normal indicates a huge lump in the low end of the tail, and also a larger proportion of very high values than expected. > > Categorizing the values into a few categories labeled, > "none, almost none, ...." is one way to convert your scores. > If those labels do make sense. Makes sense at the low end 0 risk. And at the high end I used 90+ representing using a dirty needle/syringe once a day or more often. The 2 middle categories were pretty arbitrary. If I analyze a contingency Table using the 4-category response and a 3-category measure of the primary covariate (categories defined using "clinically meaningful" categories, the association is quite strong and I used the exact p-value associated with the CMH difference in row means test (using SAS) and the association is signficant. I also used the 3-category predictor and the procedures outlined by Stokes et al. (2000) to estimate a rank analysis of covariance--again with consistent results. I've also run a few other analyses I didn't describe. I used the Box-Cox procedure to find a power transformation. Although the skewness statistic then looks great, the distribution is still not approximately normal. However, a regression using the transformed variable is consistent with the ordered logit and the contingency table analysis. One of the other posters asked about the appropriate error term--I guess that lies at the heart of my inquiry. I have no idea what the appropriate error term would be, and to best model such data. I often deal with similar response variables that have distributions in which observations are clustered at 1 or both ends of the continuum. In most cases, these distributions are not even approximately unimodal and a bit skewed--variables for which normalizing power transformations make sense. Additionally, these typically aren't outcomes that could be thought of as being generated by a gaussian process. In some cases I think it makes sense to consider poisson and generalizations of poisson processes although there is clearly much greater between subject heterogeneity than assumed by a poisson process. I estimated poission and negative binomial regression models--there was compelling evidence that the poission was overdispersed. I also used a Vuong statistic to compare NB regression with zero-inflated NB regression--the results support the zero-inflated model. The model standard errors for a zero-inflated model are wildly different than the Huber-White sandwich robust standard errors. The later give results that are fairly consistent with the ordered logit, the model based standard errors are huge--given that these are asymptotic statistics and I have a relatively small sample, I don't really trust either. I think a lot of folks just run standard analyses or arbitrarily apply some "normalizing" transformation because that's whats done in their field. Then report the results without really examining the underlying distributions. I'm curious how folks procede when they encounter very goofy distrubions. Thanks for your comments. ================================================================= Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =================================================================