AI-GEOSTATS: Summary: data transformation and variograms

Juliann Aukema Wed, 09 May 2001 13:11:37 -0700
I appologize for the delay in posting a summary to my question on data
transformation. Here it is, better late than never, I hope. Thanks a lot to
all those who responded.

Juliann


The key question about transformations and geostatistics is whether one
needs to re-transform. For example, if one uses a log transform (not logit)
then usually one wants to re-transform to the original form
whereas in the case of the indicator transform one does not re-transform.

The two difficulties and problems that arise are (1) how the variogram of
the original and the variogram of the transformed variable are related, (2)
in the case of a re-transformation how to compute the bias.
(1) is probably not  a problem if you are not going to re-transform but to
actually compute the relationship one would need to know the multivariate
distribution density function (even then it may be difficult)
which is very unlikely in most geostatistical applications.

Donald Meyers

It is always better to use untransformed data if you
can.

Every complexity you add to your modelling increases
your chances of things going wrong exponentially.

Prime rule: simpler is always better


What I do every time I get a new set of data is the
following:

(1) calculate semi-variograms and look at histograms.
If semi-variogram nice, model and continue. If not:

(2) take logarithms and repeat. If still not nice:

(3) try indicators (lots of) to see if you have mixed
distributions or something similar. If still not nice:

(4) try a rank order (uniform) transform. If you still
don't got nice semi-variograms there is something
BADLY WRONG with your data. Re-assess your basic
assumptions:

(a) precise reproducable data?
(b) accurate representative data?
(c) homogeneous sampling zones (single populations)?
(d) trend?


Isobel Clark

Handling correlation on the link scale vs handling it on the unadjusted
scale is apparently "a topic of discussion in statistics."  However, the
following
may help:  if you handle covariance on the link scale you are working with
a subject-specific model while a population averaged model refers to
modeling the covariance in the error term.  I'd recommend getting a
copy of Wolfinger R and M O'Connell 1993 on generalized linear mixed models.

 Fundamentally, your approach may depend on your goals.  Are you really
trying to explain outcomes using predictor variables?  Are you
fundamentally interested in the covariance from an ecological
perspective?  Or, are you trying to predict the number of trees per given
area??

If your goal falls into the former two categories and if you have a
nonignorable source of nonstationarity, then you can adjust for that
nonstationarity using binary or binomial regression.  If you have covariates
at the tree level, then you might want to use the binary route.  You'll
need to pick a link but you might find that a logit link might get you
started.  After modeling the mean using logistic regression, you can
assess the spatial structure of the residuals by building semivariograms
from the Pearson or deviance residuals.   if you observe structure, you can
model both the nonstationarity *and* the covariance using
generalized linear mixed models.  if you get this far, you should probably
have read the papers below (or their equivalent).  you can model spatial
variability as either a random effect or as correlated errors.  all
this can be done in SAS using PROC LOGISTIC, PROC SEMIVARIOGRAM and the
GLIMMIX macro, respectively .  Brian

z. Gotway, CA and WW Stroup. 1997.  A generalized linear model approach to
spatial data analysis and prediction. JABES 2: 157-178.
 aa. Gumpertz, ML, C Wu and JM Pye. 2000. Logistic regression for Southern
Pine Beetle outbreaks with spatial and temporal correlation. Forest Science
46: 95-107.
 Wolfinger, R. 1993. Covariance structure selection in general mixed
models. Communications in Statistics–Simulations 22: 1079-1106.
 Wolfinger, R. and M O'Connell. 1993. Generalized Linear Mixed Models: A
Pseudo-Likelihood Approach. Journal of Statistical Computation and
Simulation 48: 233-243

Brian Gray

I think the problem might be even more subtle. Essentially you are looking
at a marked point process, and trying to apply methods designed
principally for data that is continuous throughout the sampling domain.

I would suggest looking at the following paper:
Stoyan and Waelder 2000. On variograms in point process statistics
II. Models of markings and ecological interpretation. Biometrical journal
42(2):171-187

Another approach you might think about is spatial cdf estimation. take a
look at the work of cressie and friends.

Nicholas Lewin-Koh


> >Juliann Aukema wrote:
> >
> >> Hi. I have a question about transforming data.
> >>
> >>         I have infection prevalence data for many points- a proportion of
> >> trees infected. Numbers are between 0 and 1. Sample size varies for the
> >> different points (because density of trees varies). When I plot a
>variogram
> >> of the prevalence data, I get a nice  sill for about 4000 meters and
>then a
> >> rise in the variogram. If I take the residuals of prevalence against
> >> elevation the second rise goes away. Biologically this all makes sense and
> >> makes a nice story.
> >>          However for some other analyses that I also did with this data, I
> >> was advised to logit transform the prevalence data because it is a
> >> proportion and should be binomially distributed.
> >>         If I plot the variogram of the logit transformed prevalence, the
> >> first sill is much less distinct if it is there at all - this seems to be
> >> mostly due to one point, the last point before the rise, which now goes up
> >> instead of being about even with the previous point. ( I guess this
> >> difference is due to the stretching of zero prevalence values that occurs
> >> with the logit transformation.) And if I look at smaller lags, it looks
> >> like a power function with no sill. Biologically, that is harder to
> >> explain.  If I plot the residuals of the (logit transformed prevalence)
> >> against ( elevation),  the variogram has a nice sill and is similar, even
> >> prettier  than  the analysis of the untransformed data (but based on the
> >> previous variogram, I don't have a very good reason for plotting the
> >> residuals).
> >>         My question, then is whether the logit transformation is necessary
> >> and/or appropriate  for the geostatistical analysis. Does it make sense to
> >> use the transformed data for both variograms, for just the residuals
> >> (because the residuals are based on regression for which the
>transformation
> >> ought to be done) or for neither?
> >>         Thank you very much.
> >>
> >> Juliann
> >> [EMAIL PROTECTED]
> >>



--
* To post a message to the list, send it to [EMAIL PROTECTED]
* As a general service to the users, please remember to post a summary of any useful 
responses to your questions.
* To unsubscribe, send an email to [EMAIL PROTECTED] with no subject and "unsubscribe 
ai-geostats" followed by "end" on the next line in the message body. DO NOT SEND 
Subscribe/Unsubscribe requests to the list
* Support to the list is provided at http://www.ai-geostats.org
AI-GEOSTATS: Summary: data transformation and variograms

Reply via email to