Roger It really depends a lot on the problem at hand. You need to decide the treatment of the data usually in terms of some sort of model of the expected distribution of the results.
You can calculate a median from sample data and then you need to assess how different its value might have been with other samples from the same problem space - getting rid of outliers is similar to computing a confidence interval that truncates long tails. Measurements are never distributed in magic bell shaped curves or even some other normal (Gaussian) distribution. One way to think about outliers is to consider that the data comes from a mixture of more than one distribution - there is contamination. Standard deviation is one of the usual things to estimate but first you need some criteria to decide if the data points in your sample are significant or not significant - that can only come from knowing what the data is - and if it is not complete population data - how large is the sample and how much variation is expected from the sampling process. You can imagine a case where there might be a cluster of values centred around a median or mean but with very stretched tails particularly on one side by considering response times. If your data was a list of response times, the value of 161241 is someone asleep at the switch, other values might involve distraction - or if it is machine response time - there might be something important in the distribution of the outliers. So, you might have a series of observations but the long straggling tails impart so much variance that the sample mean is less precise - otherwise for a nice normal distribution it gives 2/Pi of the information about location. One strategy is to do "running" calculations including some number of adjacent values and expanding outward until you get a stable value for the median - use any odd number but sometimes begin with three number at the mid point of the list - for example if there are 20 values include first the 9th 10th and 11th numbers in the ordered list. Then you use strategies for smoothing. I think you might find examples of this online or in texts using some of these words for search terms You might know the curve you are looking for - the example you give would be a flat curve along the horizontal axis If it is actually continuous data there might be a function and do some regression analysis trends running medians smoothed plots long tails etc Donna [email protected] On 2012-01-09, at 11:41 PM, Roger Hui wrote: > Thanks. What's a reasonable multiple to use? > > > > On Mon, Jan 9, 2012 at 7:41 PM, Brian Schott <[email protected]> wrote: > >> John Tukey has studied outliers extensively in his interactive data >> analysis. He computes a box plot by measuring the IQR, that's >> interquartile range, of the data set. He adds and subtracts a multiple >> of the IQR to the upper and lower quartiles of the box in the boxplot. >> Data values outside the "hinges" (in Tukey speak) are outliers. >> >> The code below is from Donald R. McNeil's IDA, A Practical Primer. >> >> http://www.pixentral.com/show.php?picture=1Fnz2FOWX9nuYzndC9GbDbi2z1yz50 >> >> >> --- >> (B=) >> >> On Jan 9, 2012, at 7:49 PM, Roger Hui <[email protected]> wrote: >> >>> I wonder if there are well-known techniques in statistics for dealing >> with >>> the following problem. >>> >>> t >>> 11 10 10 10 10 11 10 10 10 10 9 11 10 11 10 10 11 10 11 10 11 10 10 >>> 11 10 11 10 10 10 11 10 74 11 11 14 11 11 10 12 11 15 14 12 11 >>> 11 11 11 11 10 12 11 11 11 10 11 11 11 10 11 11 10 11 161241 49 >>> 32 12 11 11 12 10 11 10 12 11 12 11 11 12 11 11 12 11 11 11 12 >>> 11 11 12 11 11 11 11 11 11 11 10 11 11 12 12 >>> >>> t is a set of samples from a noisy source which is supposed to give the >>> same integer answer. Obviously, 161241 is an "outlier", and it is likely >>> that 74, 49, or even 32 are outliers too. Are there standard techniques >>> for discarding outliers to clean up the data, before the application of >>> statistical tests such as the means test or large sample test? >> >> > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
