Carsten Dormann
Wed, 24 Jun 2009 00:47:11 -0700
Dear Nate,although I learned from Phillippe's response about the existence of log1p, I don't think I will use it (for reasons below). Thierry's response is true for Poisson data, but not for non-integer values. Still, it points into an important direction: All too often zeros emanate from a different process than the other values (see mixed distributions, zero-inflated, hurdle and all that). In that case, you should consult Ben Bolker's excellent book (which is probably still available as a draft on his homepage, but also worth buying).
If you want to transform, here is my take: My folk-law guidelines on the c in log(x+c) are:1. c should roughly be 1/2 of the smallest, non-zero value: signif(0.5*sort(unique(x))[2], 2) 2. c should be quadrat of the first quantile devided by the third quantile: (quantile(x)[2]^2)/quantile(x)[4]
For example: set.seed(11011) x <- c(runif(95), rep(0,5)) Method 1: c=0.0015 Method 2: c=0.015While this looks like a huge difference (an order of magnitude), it actually isn't all that much, given the range of the data:
plot(density(x)) abline(v=c(0.0015, 0.015))I do have a reference for method 2, but it is German (Stahel, W. A. (2002) Statistische Datenanalyse. Eine Einführung für Naturwissenschaftler. Vieweg, Braunschweig.). _ Method 1 is what my PhD's statistics adviser recommended. Since he was right in everything else, I rely on his advise here, too. That may, I acknowledge, not be good enough for you. But maybe someone else finds a proper reference.
The key thing for any value of c is that it doesn't distort the analysis. But then, how do you detect distortion? I used a comparison of rank-transformed data and various values of c. When c was large (in the current example e.g. 0.5 or so), the analysis started to differ from the rank-analysis. To use log1p here would be a dramatic distortion!
Another way to look at it is through Box-Cox-transformation. Since Box-Cox transforms towards symmetric (not necessarily normal) distribution, also c should be chosen in such a way as to facilitate the transformation towards symmetry.
HTH, Carsten Nate Upham wrote:
I have a general stats question for you guys:How does one normally deal with zero (0) values when log transforming data?I would like to log transform (natural log, ln) several response variables for use in quantile regression. But one of my variables includes several zero values. Since ln(0) = infinity, this is not readily possible. Is it best to remove all data with zero values? Or should I add a very small number to each value (e.g., 0.00001)? This seems problematic. Is there an easy way to address this issue? Thanks much for your help, --Nate _________________________________ Nathan S. Upham Ph.D. student Committee on Evolutionary Biology University of Chicago 1025 E. 57th St., Culver 402 Chicago, IL 60637 nsup...@uchicago.edu _______________________________________________ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
-- Dr. Carsten F. Dormann Department of Computational Landscape EcologyHelmholtz Centre for Environmental Research-UFZ Permoserstr. 15
04318 Leipzig Germany Tel: ++49(0)341 2351946 Fax: ++49(0)341 2351939 Email: carsten.dorm...@ufz.de internet: http://www.ufz.de/index.php?de=4205 _______________________________________________ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology