Alberto Monteiro wrote:
>
> If I have a large, but truncated,
> sample of something,
> how can I estimate what is the probability that this
> something is bigger than some number?
>
What do you mean by 'truncated'?
If you have a random sample that is randomly ordered and you loose any
part of it then in most cases you just get a smaller random sample.
(The exceptions obviously occur when data points are deleted with a
biased function.)
If the data is ordered and the deleting function is random and unbiased
then you have a smaller random sample.
Even if the deletion were periodic (with a linear or specifiable
period) then the data could be treated *as if* it were random (since it
is) and unbiased (even though this cannot be known). Just exercise
caution with auto-correlation, cross correlation and time-serries
analysis.
If you lost intervening middle sections from ordered data then most
aggregate statistics and time series will work GIVEN the additional (and
generally valid) assumption of continuity -- analogous to that used in
common calculus. This family of problems are *MUCH* easier if the
number of lost data points in each span is known or is subject to close
approximation by strong, nearly deterministic inference.
If you loose one end or the other of the data (as seems to be the case
here) then aggregate analysis becomes difficult, but time series is just
effected by the smaller sample set.
Standard aggregate statistics will remain applicable without
modification *ONLY* on datum drawn from the sample space. That is,
statistically forecast probabilities can lie only in the sampled
sub-population.
In addition, if the number of missing points is known (or is subject to
strong inference) then the probability that a given datum (specified or
not) belongs to either sub-population are a straight-forward problems in
confidence intervals suitable for an introductory text. If the size of
the missing part is not known, then these sets of problems require its
estimation.
Estimating the missing part of the distribution will require that the
researcher make assumptions and will further introduce statistical error
(that will multiply any errors introduced in statistical reasoning
against the estimated distribution. In effect, error will be squared.)
Any estimation process will use many degrees of freedom. Note that any
forecast will be more subject to error as it moves away from the
surviving sample. This means that concave or closed distribution models
are more easily estimated as are assumptions that indicate the missing
sample data is a relatively small proportion of the whole. (Beware! A
common interpretive error will be failure to adjust for this phenomenon
when evaluating alternative models.) Also, stronger distribution
forecasts will be possible if the researcher knows how many points are
missing.
> In numbers: from a random sample { x0[i], i = 1 to n }
> I observe the sample x[i] = min(x0[i], Xcut), what is
> the probability that x0 is greater than K?
>
> Alberto Monteiro