although I learned from Phillippe's response about the existence of
log1p, I don't think I will use it (for reasons below). Thierry's
response is true for Poisson data, but not for non-integer values.
Still, it points into an important direction: All too often zeros
emanate from a different process than the other values (see mixed
distributions, zero-inflated, hurdle and all that). In that case, you
should consult Ben Bolker's excellent book (which is probably still
available as a draft on his homepage, but also worth buying).
If you want to transform, here is my take:
My folk-law guidelines on the c in log(x+c) are:
1. c should roughly be 1/2 of the smallest, non-zero value:
2. c should be quadrat of the first quantile devided by the third
x <- c(runif(95), rep(0,5))
Method 1: c=0.0015
Method 2: c=0.015
While this looks like a huge difference (an order of magnitude), it
actually isn't all that much, given the range of the data:
I do have a reference for method 2, but it is German (Stahel, W. A.
(2002) Statistische Datenanalyse. Eine Einführung für
Naturwissenschaftler. Vieweg, Braunschweig.).
_ Method 1 is what my PhD's statistics adviser recommended. Since he was
right in everything else, I rely on his advise here, too. That may, I
acknowledge, not be good enough for you. But maybe someone else finds a
The key thing for any value of c is that it doesn't distort the
analysis. But then, how do you detect distortion? I used a comparison of
rank-transformed data and various values of c. When c was large (in the
current example e.g. 0.5 or so), the analysis started to differ from the
rank-analysis. To use log1p here would be a dramatic distortion!
Another way to look at it is through Box-Cox-transformation. Since
Box-Cox transforms towards symmetric (not necessarily normal)
distribution, also c should be chosen in such a way as to facilitate the
transformation towards symmetry.
Nate Upham wrote:
I have a general stats question for you guys:
How does one normally deal with zero (0) values when log transforming data?
I would like to log transform (natural log, ln) several response variables for
use in quantile
regression. But one of my variables includes several zero values. Since ln(0)
= infinity, this is
not readily possible. Is it best to remove all data with zero values? Or
should I add a very small
number to each value (e.g., 0.00001)? This seems problematic. Is there an
easy way to address this
Thanks much for your help,
Nathan S. Upham
Committee on Evolutionary Biology
University of Chicago
1025 E. 57th St., Culver 402
Chicago, IL 60637
R-sig-ecology mailing list
Dr. Carsten F. Dormann
Department of Computational Landscape Ecology
Helmholtz Centre for Environmental Research-UFZ
Tel: ++49(0)341 2351946
Fax: ++49(0)341 2351939
R-sig-ecology mailing list