Dear all,

I’m sorry for being so late with the summary of the replies I got to the
following question

- What are the drawbacks of the normal score transformation 
- What are the latest developments that have been made to handle properly data
sets that have a log normal distribution. 

I have cut and pasted here under bits and parts of the many replies I
received. Thanks a lot to: 

Andrew, Joao Felipe, Isobel Clark, Paulo Justiniano Ribeiro Jr, Warr Benjamin,
Hirotaka Saito, Nelleke Swager, Syed Abdul Rahman Shibli, Raymond J. O'Connor

I also received a two pages long reply from Donald Myers. I have put the full
text in the archives of AI-GEOSTATS. 
-----------------------------------------------------------------------

A. Comments on skewed data sets

The skewness of a data set can have many different origins and its
interpretation is of course highly subjective. Many assumptions have therefore
to be made. 

Most of geostatistics is "distribution free", i.e., the derivation of the
simple kriging, ordinary kriging and universal kriging equations do not depend
on a distributional assumption (contrary to what is sometimes claimed).
However if a distributional assumption is to be useful it should be
multivariate rather than just univariate. Essentially none of the
transformations that are used in geostatistics can really preserve or produce
multivariate distributional properties, they are only univariate
transformations. For example, a histogram might appear lognormal and a log
transformation might then appear normal, this does not imply anything about
multivariate lognormality. 

When handling skewed data sets, one can 

1) remove the long tail and dismiss them as another population, i.e. work with
the main subset

2) dismiss the long tail as a set of "erroneous" data (this might be difficult
to justify) 

3) use the data "as is" and use more robust measures, e.g. madogram, and do
not work with squared differences which are quite sensitive to long tails. The
choice of the sill becomes a problem in such a case.  In the case of
multivariate lognormality, one can compute the relationship between the
variogram/covariance of the original and the variogram/covariance of the
transformed. This relationship is
essentially unknown in all other cases because it requires again, knowing the
multivariate distribution in analytic form (and being able to carrying out
certain complicated multiple integrations). The multivariate
transform must be known in analytic form and have a unique inverse. There are
examples in the literature of using power series approximations for the
transformation but too often the approximation is reduced to a linear one.

4) use a transformation and work in the transformed domain before
backtransforming (watch out for possible biases, where applicable).

5) use an indicator transform for different thresholds and regard the
connectivity of extreme values foremost on your agenda. This might be
difficult to implement in practice, particularly with sparse datasets
and the deterioration of the number of "pairs" at extreme thresholds where you
would normally want the best "resolution" anyway (median indicator kriging is
a possible workaround). 

From the replies I have received, the last seems to be the most frequently
chosen option.



B. Problems with Normal Score Transformation (NST)

NST are useful to reveal the spatial correlation of highly skewed data sets.
Nevertheless, when a transformation is made prior to the estimation, several
problems will remain, First, one has introduced an element of ranking rather
than interval or ratio data for the original. Although one uses the NST data
as satisfying the requirements of normality, the back transformation  process
can only recover the point estimates (e.g., for confidence limits) within the
resolution afforded by the original data at that  point. If you have sparsely
distributed data there, the limit estimate has an uncertainty reflecting the
corresponding coarse  steps (more a measurement error than an estimation
error).

Second, if one has ties in the original data, the NST assigns them  to the
corresponding block of contiguous normal scores. Thus extra variance is
introduced as a result of handling the ties.


There are two types of nscore transformation:

1) a frequency based NST: data are transformed in order to get a histogram
showing a normal distribution. 

Inconvenient: The ordering of the tied values introduces a bias when doing a 
back-transformation, especially if there are many zero values

2) an empiricaly based NST: the transformation uses the cumulative
distribution and assigns the equivalent in the Gaussian space. When performing
a back-transformation, one get the original value. 

Inconvenient: the histogram of the transformed data is often not normal.
Nevertheless, the results after kriging and simulation appear to be relatively
robust.


C. Performing kriging with log normal data sets

Most of the replies underlined the frequent use of an indicator approach. If
Lognormal kriging seems to be the solution for log normal data sets, it is
based on the strict assumption that the data set is log normal, assumption
which is almost impossible to verify unless one has an extensive knowledge of
the data set. 
If one is willing to assume multi-variate lognormality (univariate is not
really sufficient) then the transformation is theoretically known and has a
unique inverse that is also known. Even in this case there
is the problem of a bias in the re-transformed estimates. A number of authors
have written on this, Journel, Dowd being two of them (see various papers in
Math. Geology). As pointed out in those papers the correction in the case of
Simple Kriging (punctual) is essentially solved, a good approximation is
available in the case of Ordinary Kriging (punctual). There are some
theoretical problems in the case of block kriging that are usually handled in
an almost ad-hoc way, e.g., if the point values are multi-variate lognormal
then the block values theoretically should not be either univariate or
multivariate lognormal. There seems to be little in the literature pertaining
to a mixing of lognormality and non-constant drift(mean). If the non-constant
mean is not first removed then the complications resulting from a non-linear
transformation are much worse since the non-constant mean and the mean zero
random component are not separately transformed.

For other non-linear transforms (other than the log in the case of
multivariate lognormality), even knowing the inverse transform in analytic
form is not sufficient to allow computing the bias adjustment unless
one also knows the MULTIVARIATE distribution in analytic form. Even then, the
actual mechanics of doing so can be very tedious or complicated. That is,
while there is a nice theorem on change of variables in a multiple integral,
the actual step of applying it to a specific problem can be very tedious and
complicated. Moreover the theorem has moderately strong assumptions which are
not always satisfied.

In the case of multivariate lognormality, one can also determine the
adjustment needed in the kriging variances. This aspect seems to have
attracted little attention in the case of other non-linear transforms and it
is at least as difficult a problem.


Apparently, lognormal kriging and indicator kriging produce very similar
results.


D. Recent developments:

The litterature seems to be quite poor in publications on non-parametric
geostatistics. 

The Box-Cox family of transformations which has the log-normal as a particular
case has been recently proposed. 



SUGGESTED READING

CHRISTENSEN, O.F., DIGGLE, P.J. AND RIBEIRO JR, P.J. (2001). Analysing
positive-valued spatial data: the transformed Gaussian model. In GeoENV III -
Geostatistics for environmental applications, Quantitative Geology and
Geostatistics, Kluwer Series (to appear)

CLARK I. 1996 "Lognormal kriging applied to non-lognormal deposits: two case
studies", 
5th International Geostatistics Congress, Wollongong Australia, 22--27
September

CLARK I. 1997. Geostatistics applied to skewed data", Conference of the
International Section on Mathematical Methods in Geology (Mining Příbram
Symposia) of the International Association for Mathematical Geology, Prague,
6--10 October, Matematicke Metody V Geologii: Příbram Scientiae Rerum
Montanarum

CLARK I. 1998. Geostatistical estimation and the lognormal distribution
Geocongress, Pretoria RSA, June

SAITO, H. and P. GOOVAERTS. 2000. Geostatistical interpolation of positively
skewed and censored data in a dioxin contaminated site. Environmental Science
& Technology, vol.34, No.19: 4228-4235.



Gregoire Dubois
Institute of Mineralogy and Petrography 
Dept. of Earth Sciences 
University of Lausanne 
Switzerland 

http://www.ai-geostats.org

____________________________________________________________________
Get free email and a permanent address at http://www.netaddress.com/?N=1

--
* To post a message to the list, send it to [EMAIL PROTECTED]
* As a general service to the users, please remember to post a summary of any useful 
responses to your questions.
* To unsubscribe, send an email to [EMAIL PROTECTED] with no subject and "unsubscribe 
ai-geostats" followed by "end" on the next line in the message body. DO NOT SEND 
Subscribe/Unsubscribe requests to the list
* Support to the list is provided at http://www.ai-geostats.org

Reply via email to