Dear all, I’m sorry for being so late with the summary of the replies I got to the following question - What are the drawbacks of the normal score transformation - What are the latest developments that have been made to handle properly data sets that have a log normal distribution. I have cut and pasted here under bits and parts of the many replies I received. Thanks a lot to: Andrew, Joao Felipe, Isobel Clark, Paulo Justiniano Ribeiro Jr, Warr Benjamin, Hirotaka Saito, Nelleke Swager, Syed Abdul Rahman Shibli, Raymond J. O'Connor I also received a two pages long reply from Donald Myers. I have put the full text in the archives of AI-GEOSTATS. ----------------------------------------------------------------------- A. Comments on skewed data sets The skewness of a data set can have many different origins and its interpretation is of course highly subjective. Many assumptions have therefore to be made. Most of geostatistics is "distribution free", i.e., the derivation of the simple kriging, ordinary kriging and universal kriging equations do not depend on a distributional assumption (contrary to what is sometimes claimed). However if a distributional assumption is to be useful it should be multivariate rather than just univariate. Essentially none of the transformations that are used in geostatistics can really preserve or produce multivariate distributional properties, they are only univariate transformations. For example, a histogram might appear lognormal and a log transformation might then appear normal, this does not imply anything about multivariate lognormality. When handling skewed data sets, one can 1) remove the long tail and dismiss them as another population, i.e. work with the main subset 2) dismiss the long tail as a set of "erroneous" data (this might be difficult to justify) 3) use the data "as is" and use more robust measures, e.g. madogram, and do not work with squared differences which are quite sensitive to long tails. The choice of the sill becomes a problem in such a case. In the case of multivariate lognormality, one can compute the relationship between the variogram/covariance of the original and the variogram/covariance of the transformed. This relationship is essentially unknown in all other cases because it requires again, knowing the multivariate distribution in analytic form (and being able to carrying out certain complicated multiple integrations). The multivariate transform must be known in analytic form and have a unique inverse. There are examples in the literature of using power series approximations for the transformation but too often the approximation is reduced to a linear one. 4) use a transformation and work in the transformed domain before backtransforming (watch out for possible biases, where applicable). 5) use an indicator transform for different thresholds and regard the connectivity of extreme values foremost on your agenda. This might be difficult to implement in practice, particularly with sparse datasets and the deterioration of the number of "pairs" at extreme thresholds where you would normally want the best "resolution" anyway (median indicator kriging is a possible workaround). From the replies I have received, the last seems to be the most frequently chosen option. B. Problems with Normal Score Transformation (NST) NST are useful to reveal the spatial correlation of highly skewed data sets. Nevertheless, when a transformation is made prior to the estimation, several problems will remain, First, one has introduced an element of ranking rather than interval or ratio data for the original. Although one uses the NST data as satisfying the requirements of normality, the back transformation process can only recover the point estimates (e.g., for confidence limits) within the resolution afforded by the original data at that point. If you have sparsely distributed data there, the limit estimate has an uncertainty reflecting the corresponding coarse steps (more a measurement error than an estimation error). Second, if one has ties in the original data, the NST assigns them to the corresponding block of contiguous normal scores. Thus extra variance is introduced as a result of handling the ties. There are two types of nscore transformation: 1) a frequency based NST: data are transformed in order to get a histogram showing a normal distribution. Inconvenient: The ordering of the tied values introduces a bias when doing a back-transformation, especially if there are many zero values 2) an empiricaly based NST: the transformation uses the cumulative distribution and assigns the equivalent in the Gaussian space. When performing a back-transformation, one get the original value. Inconvenient: the histogram of the transformed data is often not normal. Nevertheless, the results after kriging and simulation appear to be relatively robust. C. Performing kriging with log normal data sets Most of the replies underlined the frequent use of an indicator approach. If Lognormal kriging seems to be the solution for log normal data sets, it is based on the strict assumption that the data set is log normal, assumption which is almost impossible to verify unless one has an extensive knowledge of the data set. If one is willing to assume multi-variate lognormality (univariate is not really sufficient) then the transformation is theoretically known and has a unique inverse that is also known. Even in this case there is the problem of a bias in the re-transformed estimates. A number of authors have written on this, Journel, Dowd being two of them (see various papers in Math. Geology). As pointed out in those papers the correction in the case of Simple Kriging (punctual) is essentially solved, a good approximation is available in the case of Ordinary Kriging (punctual). There are some theoretical problems in the case of block kriging that are usually handled in an almost ad-hoc way, e.g., if the point values are multi-variate lognormal then the block values theoretically should not be either univariate or multivariate lognormal. There seems to be little in the literature pertaining to a mixing of lognormality and non-constant drift(mean). If the non-constant mean is not first removed then the complications resulting from a non-linear transformation are much worse since the non-constant mean and the mean zero random component are not separately transformed. For other non-linear transforms (other than the log in the case of multivariate lognormality), even knowing the inverse transform in analytic form is not sufficient to allow computing the bias adjustment unless one also knows the MULTIVARIATE distribution in analytic form. Even then, the actual mechanics of doing so can be very tedious or complicated. That is, while there is a nice theorem on change of variables in a multiple integral, the actual step of applying it to a specific problem can be very tedious and complicated. Moreover the theorem has moderately strong assumptions which are not always satisfied. In the case of multivariate lognormality, one can also determine the adjustment needed in the kriging variances. This aspect seems to have attracted little attention in the case of other non-linear transforms and it is at least as difficult a problem. Apparently, lognormal kriging and indicator kriging produce very similar results. D. Recent developments: The litterature seems to be quite poor in publications on non-parametric geostatistics. The Box-Cox family of transformations which has the log-normal as a particular case has been recently proposed. SUGGESTED READING CHRISTENSEN, O.F., DIGGLE, P.J. AND RIBEIRO JR, P.J. (2001). Analysing positive-valued spatial data: the transformed Gaussian model. In GeoENV III - Geostatistics for environmental applications, Quantitative Geology and Geostatistics, Kluwer Series (to appear) CLARK I. 1996 "Lognormal kriging applied to non-lognormal deposits: two case studies", 5th International Geostatistics Congress, Wollongong Australia, 22--27 September CLARK I. 1997. Geostatistics applied to skewed data", Conference of the International Section on Mathematical Methods in Geology (Mining Příbram Symposia) of the International Association for Mathematical Geology, Prague, 6--10 October, Matematicke Metody V Geologii: Příbram Scientiae Rerum Montanarum CLARK I. 1998. Geostatistical estimation and the lognormal distribution Geocongress, Pretoria RSA, June SAITO, H. and P. GOOVAERTS. 2000. Geostatistical interpolation of positively skewed and censored data in a dioxin contaminated site. Environmental Science & Technology, vol.34, No.19: 4228-4235. Gregoire Dubois Institute of Mineralogy and Petrography Dept. of Earth Sciences University of Lausanne Switzerland http://www.ai-geostats.org ____________________________________________________________________ Get free email and a permanent address at http://www.netaddress.com/?N=1 -- * To post a message to the list, send it to [EMAIL PROTECTED] * As a general service to the users, please remember to post a summary of any useful responses to your questions. * To unsubscribe, send an email to [EMAIL PROTECTED] with no subject and "unsubscribe ai-geostats" followed by "end" on the next line in the message body. DO NOT SEND Subscribe/Unsubscribe requests to the list * Support to the list is provided at http://www.ai-geostats.org