Roger

It really depends a lot on the problem at hand.  You need to decide the 
treatment of the data usually in terms of some sort of model of the expected 
distribution of the results.

You can calculate a median from sample data and then you need to assess how 
different its value might have been with other samples from the same problem 
space - getting rid of outliers is similar to computing a confidence interval 
that truncates long tails.

Measurements are never distributed in magic bell shaped curves or even some 
other normal (Gaussian) distribution.

One way to think about outliers is to consider that the data comes from a 
mixture of more than one distribution - there is contamination.

Standard deviation is one of the usual things to estimate but first you need 
some criteria to decide if the data points in your sample are significant or 
not significant - that can only come from knowing what the data is - and if it 
is not complete population data - how large is the sample and how much 
variation is expected from the sampling process.

You can imagine a case where there might be a cluster of values centred around 
a median or mean but with very stretched tails particularly on one side by 
considering response times.  If your data was a list of response times, the 
value of 161241 is someone asleep at the switch, other values might involve 
distraction - or if it is machine response time - there might be something 
important in the distribution of the outliers.

So, you might have a series of observations but the long straggling tails 
impart so much variance that the sample mean is less precise - otherwise for a 
nice normal distribution it gives 2/Pi of the information about location.

One strategy is to do "running" calculations including some number of adjacent 
values and expanding outward until you get a stable value for the median - use 
any odd number but sometimes begin with three number at the mid point of the 
list - for example if there are 20 values include first the 9th 10th and 11th 
numbers in the ordered list.  Then you use strategies for smoothing.

I think you might find examples of this online or in texts using some of these 
words for search terms

You might know the curve you are looking for - the example you give would be a 
flat curve along the horizontal axis

If it is actually continuous data there might be a function and do some 
regression analysis

trends
running medians
smoothed plots
long tails
etc




Donna 
[email protected]


On 2012-01-09, at 11:41 PM, Roger Hui wrote:

> Thanks.  What's a reasonable multiple to use?
> 
> 
> 
> On Mon, Jan 9, 2012 at 7:41 PM, Brian Schott <[email protected]> wrote:
> 
>> John Tukey has studied outliers extensively in his interactive data
>> analysis. He computes a box plot by measuring the IQR, that's
>> interquartile range, of the data set. He adds and subtracts a multiple
>> of the IQR to the upper and lower quartiles of the box in the boxplot.
>> Data values outside the "hinges" (in Tukey speak) are outliers.
>> 
>> The code below is from Donald R. McNeil's IDA, A Practical Primer.
>> 
>> http://www.pixentral.com/show.php?picture=1Fnz2FOWX9nuYzndC9GbDbi2z1yz50
>> 
>> 
>> ---
>> (B=)
>> 
>> On Jan 9, 2012, at 7:49 PM, Roger Hui <[email protected]> wrote:
>> 
>>> I wonder if there are well-known techniques in statistics for dealing
>> with
>>> the following problem.
>>> 
>>>     t
>>> 11 10 10 10 10 11 10 10 10 10 9 11 10 11 10 10 11 10 11 10 11 10 10
>>>     11 10 11 10 10 10 11 10 74 11 11 14 11 11 10 12 11 15 14 12 11
>>>     11 11 11 11 10 12 11 11 11 10 11 11 11 10 11 11 10 11 161241 49
>>>     32 12 11 11 12 10 11 10 12 11 12 11 11 12 11 11 12 11 11 11 12
>>>     11 11 12 11 11 11 11 11 11 11 10 11 11 12 12
>>> 
>>> t is a set of samples from a noisy source which is supposed to give the
>>> same integer answer.  Obviously, 161241 is an "outlier", and it is likely
>>> that 74, 49, or even 32 are outliers too.  Are there standard techniques
>>> for discarding outliers to clean up the data, before the application of
>>> statistical tests such as the means test or large sample test?
>> 
>> 
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
> 

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to