On the reliability of data:

Question to a mathematician:
What is two plus two?
He responds:
Four.

To an Engineer:
What is two plus two?
He responds:
Three point nine nine (pause with a lot of thought) nine.

To a Geologist:
What is two plus two?
He responds:
Somewhere between three and five.

To a Geophysicist:
What is two plus two?
He responds whispering in your ear:
What would you like it to be?


On Wed, Jan 11, 2012 at 5:09 PM, Donna Y <[email protected]> wrote:

> Roger
>
> It really depends a lot on the problem at hand.  You need to decide the
> treatment of the data usually in terms of some sort of model of the
> expected distribution of the results.
>
> You can calculate a median from sample data and then you need to assess
> how different its value might have been with other samples from the same
> problem space - getting rid of outliers is similar to computing a
> confidence interval that truncates long tails.
>
> Measurements are never distributed in magic bell shaped curves or even
> some other normal (Gaussian) distribution.
>
> One way to think about outliers is to consider that the data comes from a
> mixture of more than one distribution - there is contamination.
>
> Standard deviation is one of the usual things to estimate but first you
> need some criteria to decide if the data points in your sample are
> significant or not significant - that can only come from knowing what the
> data is - and if it is not complete population data - how large is the
> sample and how much variation is expected from the sampling process.
>
> You can imagine a case where there might be a cluster of values centred
> around a median or mean but with very stretched tails particularly on one
> side by considering response times.  If your data was a list of response
> times, the value of 161241 is someone asleep at the switch, other values
> might involve distraction - or if it is machine response time - there might
> be something important in the distribution of the outliers.
>
> So, you might have a series of observations but the long straggling tails
> impart so much variance that the sample mean is less precise - otherwise
> for a nice normal distribution it gives 2/Pi of the information about
> location.
>
> One strategy is to do "running" calculations including some number of
> adjacent values and expanding outward until you get a stable value for the
> median - use any odd number but sometimes begin with three number at the
> mid point of the list - for example if there are 20 values include first
> the 9th 10th and 11th numbers in the ordered list.  Then you use strategies
> for smoothing.
>
> I think you might find examples of this online or in texts using some of
> these words for search terms
>
> You might know the curve you are looking for - the example you give would
> be a flat curve along the horizontal axis
>
> If it is actually continuous data there might be a function and do some
> regression analysis
>
> trends
> running medians
> smoothed plots
> long tails
> etc
>
>
>
>
> Donna
> [email protected]
>
>
> On 2012-01-09, at 11:41 PM, Roger Hui wrote:
>
> > Thanks.  What's a reasonable multiple to use?
> >
> >
> >
> > On Mon, Jan 9, 2012 at 7:41 PM, Brian Schott <[email protected]>
> wrote:
> >
> >> John Tukey has studied outliers extensively in his interactive data
> >> analysis. He computes a box plot by measuring the IQR, that's
> >> interquartile range, of the data set. He adds and subtracts a multiple
> >> of the IQR to the upper and lower quartiles of the box in the boxplot.
> >> Data values outside the "hinges" (in Tukey speak) are outliers.
> >>
> >> The code below is from Donald R. McNeil's IDA, A Practical Primer.
> >>
> >>
> http://www.pixentral.com/show.php?picture=1Fnz2FOWX9nuYzndC9GbDbi2z1yz50
> >>
> >>
> >> ---
> >> (B=)
> >>
> >> On Jan 9, 2012, at 7:49 PM, Roger Hui <[email protected]>
> wrote:
> >>
> >>> I wonder if there are well-known techniques in statistics for dealing
> >> with
> >>> the following problem.
> >>>
> >>>     t
> >>> 11 10 10 10 10 11 10 10 10 10 9 11 10 11 10 10 11 10 11 10 11 10 10
> >>>     11 10 11 10 10 10 11 10 74 11 11 14 11 11 10 12 11 15 14 12 11
> >>>     11 11 11 11 10 12 11 11 11 10 11 11 11 10 11 11 10 11 161241 49
> >>>     32 12 11 11 12 10 11 10 12 11 12 11 11 12 11 11 12 11 11 11 12
> >>>     11 11 12 11 11 11 11 11 11 11 10 11 11 12 12
> >>>
> >>> t is a set of samples from a noisy source which is supposed to give the
> >>> same integer answer.  Obviously, 161241 is an "outlier", and it is
> likely
> >>> that 74, 49, or even 32 are outliers too.  Are there standard
> techniques
> >>> for discarding outliers to clean up the data, before the application of
> >>> statistical tests such as the means test or large sample test?
> >>
> >>
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to