The situation gives me little satisfaction.  I am more or less content 
with substitutions in order to make use of exploratory graphical and 
multivariate methods.  However, any substitution introduces some 
arbitrariness, and logarithms may be sensitive to the substituted value. A 
principled, likelihood based approach based on a stochastic model could be 
impractical for many data analyses and too narrowly specific. 

Here are a couple more comments, and a question of my own. 

* log(x + 1) is doubtful for an interval scale variable with units, e.g., 
a concentration, because changing the units of x changes the results of 
parametric tests.  Substitutions that have the units of x (e.g., the 
suggested min(x)/2) address this invariance issue. 

* It is generally good to look for ways to avoid a substitution, e.g., by 
using rank based methods, quantiles, or perhaps censored data methods. 

* If a substitution is made, it could be good to check that the resulting 
values are not outliers.  This could be important if the zeros are 
associated with a special population. 

Here is my question.  Does anyone have any thoughts about the particular 
situation where x is a regressor, e.g., concentration of a pollutant in a 
regression to predict a measure of stream ecological quality.  If a poor 
choice of substitution is viewed as a model miss-specification, I wonder 
about doing a a special test of fit, e.g., test the significance of a 
zero-one indicator of substitution. 

David


r-sig-ecology-boun...@r-project.org wrote on 06/24/2009 10:25:25 PM:

> [image removed] 
> 
> Re: [R-sig-eco] Log transforming zero value data
> 
> Nate Upham 
> 
> to:
> 
> Ben Bolker
> 
> 06/25/2009 06:39 AM
> 
> Sent by:
> 
> r-sig-ecology-boun...@r-project.org
> 
> Cc:
> 
> r-sig-ecology
> 
> Hey Ben,
> In most of my data plots there is a "factor-ceiling"-type 
> distribution, which I have been describing
> as "wedge-shaped" or "triangular-shaped."  The data looks like a 
> right triangle with the central
> tendency and variation decreasing as a linear function of the 
> habitat variable (X).  This pattern of
> data seems to be fairly common in ecology (either increasing or 
> decreasing), mostly as a result of
> unmeasured habitat factors influencing the response variable. 
> Either way, Cade and others have been
> advocating its uses in ecology for several years, so I thought I 
> would give it a try with my data. 
> In the quantreg package in R, the use of log() and exp() enable you 
> to model the different quantiles
> with (in my case) a log-decay type function.
> 
> Is the idea that since Y has a binomial distribution, that a GLM-
> based approach could tease apart
> specific impacts of the habitat variable on average densities in Y?
> 
> Thanks,
> --Nate
> 
> 
> ---- Original message ----
> >Date: Wed, 24 Jun 2009 21:25:36 -0400
> >From: Ben Bolker <bol...@ufl.edu> 
> >Subject: Re: [R-sig-eco] Log transforming zero value data 
> >To: Nate Upham <nsup...@uchicago.edu>
> >Cc: "r-sig-ecology@r-project.org" <r-sig-ecology@r-project.org>
> >
> >Nate Upham wrote:
> >> Hey there Ben,
> >> I was just checking out your book actually.  When you say that I 
> should do this as a binomial
> >> analysis, is that because this variable is distributed similarly 
> to a "zero-inflated binomial
> >> distribution"?
> >
> >  It's not necessarily zero-inflated -- a moderately high proportion of
> >zeros is a natural property of (non-inflated) binomial distributions
> >with small p and small/moderate N.
> >
> >> 
> >> Since all my data are non-normal, and my comparisons have 
> heterogeneous variances, I have been using
> >> quantile regression to tease apart the influence of a habitat 
> variable in upper quantiles as a
> >> limiting factor (in the spirit of Cade et al 1999, 2003).  I am 
> log transforming this response
> >> variable for quantile regression, and then back-transforming it 
> when I plot the lines at different
> >> quantiles. 
> >> 
> >> I do have access to the "denominator" as # of traps set per 
> night, but is it possible (or desirable)
> >> to incorporate the binomial distribution here when my focus is on
> quantile regression?
> >
> >  If you're focusing on quantile regression because you're really
> >interested in limiting factors or "ceilings" then I'd say stick with
> >log-transforming (but you should definitely use additive constants that
> >are "small" with respect to the variation in your data, according to 
one
> >of the recipes specified earlier in this thread -- log(1+x) and
> >log(0.5+x) really don't make sense for your data set).  [I don't really
> >know what assumptions are required for inference in quantile 
regression,
> >although I'm guessing they're pretty loose -- in particular, no
> >assumption of normality -- but it may implicitly assume continuous
> >distributions?]
> >   On the other hand, if you turned to QR as a way to get around
> >heterogeneous variances etc., and you would really prefer to be able to
> >draw conclusions about the average densities etc., then you might
> >well be able to get farther with a binomial/GLM-based approach.  Do 
your
> >data look like "factor-ceiling" distributions as in Cade et al?
> >
> >-- 
> >Ben Bolker
> >Associate professor, Biology Dep't, Univ. of Florida
> >bol...@ufl.edu / www.zoology.ufl.edu/bolker
> >GPG key: www.zoology.ufl.edu/bolker/benbolker-publickey.asc
> >
> >________________
> >signature.asc (1k bytes)
> _________________________________
> Nathan S. Upham
> Ph.D. student
> Committee on Evolutionary Biology
> University of Chicago
> 1025 E. 57th St., Culver 402
> Chicago, IL 60637
> nsup...@uchicago.edu
> 
> _______________________________________________
> R-sig-ecology mailing list
> R-sig-ecology@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

Reply via email to