On the reliability of data: Question to a mathematician: What is two plus two? He responds: Four.
To an Engineer: What is two plus two? He responds: Three point nine nine (pause with a lot of thought) nine. To a Geologist: What is two plus two? He responds: Somewhere between three and five. To a Geophysicist: What is two plus two? He responds whispering in your ear: What would you like it to be? On Wed, Jan 11, 2012 at 5:09 PM, Donna Y <[email protected]> wrote: > Roger > > It really depends a lot on the problem at hand. You need to decide the > treatment of the data usually in terms of some sort of model of the > expected distribution of the results. > > You can calculate a median from sample data and then you need to assess > how different its value might have been with other samples from the same > problem space - getting rid of outliers is similar to computing a > confidence interval that truncates long tails. > > Measurements are never distributed in magic bell shaped curves or even > some other normal (Gaussian) distribution. > > One way to think about outliers is to consider that the data comes from a > mixture of more than one distribution - there is contamination. > > Standard deviation is one of the usual things to estimate but first you > need some criteria to decide if the data points in your sample are > significant or not significant - that can only come from knowing what the > data is - and if it is not complete population data - how large is the > sample and how much variation is expected from the sampling process. > > You can imagine a case where there might be a cluster of values centred > around a median or mean but with very stretched tails particularly on one > side by considering response times. If your data was a list of response > times, the value of 161241 is someone asleep at the switch, other values > might involve distraction - or if it is machine response time - there might > be something important in the distribution of the outliers. > > So, you might have a series of observations but the long straggling tails > impart so much variance that the sample mean is less precise - otherwise > for a nice normal distribution it gives 2/Pi of the information about > location. > > One strategy is to do "running" calculations including some number of > adjacent values and expanding outward until you get a stable value for the > median - use any odd number but sometimes begin with three number at the > mid point of the list - for example if there are 20 values include first > the 9th 10th and 11th numbers in the ordered list. Then you use strategies > for smoothing. > > I think you might find examples of this online or in texts using some of > these words for search terms > > You might know the curve you are looking for - the example you give would > be a flat curve along the horizontal axis > > If it is actually continuous data there might be a function and do some > regression analysis > > trends > running medians > smoothed plots > long tails > etc > > > > > Donna > [email protected] > > > On 2012-01-09, at 11:41 PM, Roger Hui wrote: > > > Thanks. What's a reasonable multiple to use? > > > > > > > > On Mon, Jan 9, 2012 at 7:41 PM, Brian Schott <[email protected]> > wrote: > > > >> John Tukey has studied outliers extensively in his interactive data > >> analysis. He computes a box plot by measuring the IQR, that's > >> interquartile range, of the data set. He adds and subtracts a multiple > >> of the IQR to the upper and lower quartiles of the box in the boxplot. > >> Data values outside the "hinges" (in Tukey speak) are outliers. > >> > >> The code below is from Donald R. McNeil's IDA, A Practical Primer. > >> > >> > http://www.pixentral.com/show.php?picture=1Fnz2FOWX9nuYzndC9GbDbi2z1yz50 > >> > >> > >> --- > >> (B=) > >> > >> On Jan 9, 2012, at 7:49 PM, Roger Hui <[email protected]> > wrote: > >> > >>> I wonder if there are well-known techniques in statistics for dealing > >> with > >>> the following problem. > >>> > >>> t > >>> 11 10 10 10 10 11 10 10 10 10 9 11 10 11 10 10 11 10 11 10 11 10 10 > >>> 11 10 11 10 10 10 11 10 74 11 11 14 11 11 10 12 11 15 14 12 11 > >>> 11 11 11 11 10 12 11 11 11 10 11 11 11 10 11 11 10 11 161241 49 > >>> 32 12 11 11 12 10 11 10 12 11 12 11 11 12 11 11 12 11 11 11 12 > >>> 11 11 12 11 11 11 11 11 11 11 10 11 11 12 12 > >>> > >>> t is a set of samples from a noisy source which is supposed to give the > >>> same integer answer. Obviously, 161241 is an "outlier", and it is > likely > >>> that 74, 49, or even 32 are outliers too. Are there standard > techniques > >>> for discarding outliers to clean up the data, before the application of > >>> statistical tests such as the means test or large sample test? > >> > >> > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
