Dear Nate,

although I learned from Phillippe's response about the existence of log1p, I don't think I will use it (for reasons below). Thierry's response is true for Poisson data, but not for non-integer values. Still, it points into an important direction: All too often zeros emanate from a different process than the other values (see mixed distributions, zero-inflated, hurdle and all that). In that case, you should consult Ben Bolker's excellent book (which is probably still available as a draft on his homepage, but also worth buying).

If you want to transform, here is my take:

My folk-law guidelines on the c in log(x+c) are:
1. c should roughly be 1/2 of the smallest, non-zero value: signif(0.5*sort(unique(x))[2], 2) 2. c should be quadrat of the first quantile devided by the third quantile: (quantile(x)[2]^2)/quantile(x)[4]
For example:
x <- c(runif(95), rep(0,5))

Method 1: c=0.0015
Method 2: c=0.015
While this looks like a huge difference (an order of magnitude), it actually isn't all that much, given the range of the data:

abline(v=c(0.0015, 0.015))

I do have a reference for method 2, but it is German (Stahel, W. A. (2002) Statistische Datenanalyse. Eine Einführung für Naturwissenschaftler. Vieweg, Braunschweig.). _ Method 1 is what my PhD's statistics adviser recommended. Since he was right in everything else, I rely on his advise here, too. That may, I acknowledge, not be good enough for you. But maybe someone else finds a proper reference.

The key thing for any value of c is that it doesn't distort the analysis. But then, how do you detect distortion? I used a comparison of rank-transformed data and various values of c. When c was large (in the current example e.g. 0.5 or so), the analysis started to differ from the rank-analysis. To use log1p here would be a dramatic distortion!

Another way to look at it is through Box-Cox-transformation. Since Box-Cox transforms towards symmetric (not necessarily normal) distribution, also c should be chosen in such a way as to facilitate the transformation towards symmetry.



Nate Upham wrote:
I have a general stats question for you guys:

How does one normally deal with zero (0) values when log transforming data?
I would like to log transform (natural log, ln) several response variables for 
use in quantile
regression.  But one of my variables includes several zero values.  Since ln(0) 
= infinity, this is
not readily possible.  Is it best to remove all data with zero values?  Or 
should I add a very small
number to each value (e.g., 0.00001)?  This seems problematic.  Is there an 
easy way to address this

Thanks much for your help,

Nathan S. Upham
Ph.D. student
Committee on Evolutionary Biology
University of Chicago
1025 E. 57th St., Culver 402
Chicago, IL 60637

R-sig-ecology mailing list

Dr. Carsten F. Dormann
Department of Computational Landscape Ecology
Helmholtz Centre for Environmental Research-UFZ Permoserstr. 15
04318 Leipzig

Tel: ++49(0)341 2351946
Fax: ++49(0)341 2351939

R-sig-ecology mailing list

Reply via email to