Dear Philippe,

while I don't like to quibble about rules-of-thumb (since they are, as you rightly point out, without foundation in statistical theory), I would like to correct the impression that you gave in your email. Let's take an hypothetical example along the lines you proposed (or at least of what I understand you proposed).
x <- rnorm(100, mean=37, sd=1) # human body temperature
How could we here have zeros?
If we had zero (from a dead body in the snow, for example), then by transforming the values to Kelvin the value for c should indeed be 271, not 1! What a strange idea to argue that because a body was dead in the snow, we can assume it has the same temperature as the universe! No, sorry, I prefer to stick to the "VERY complicated" rule of thumb of using half the non-zero minimum.

In my experience, zeros could arise in the following circumstance: We transplant seedlings into some treatments. Some don't survive, others thrive. So, at the end of the experiment, we have the choice of given the dead seedlings a weight of 0 or NA. If we choose 0, then we are confounding two processes: survival and growth under treatment conditions. That's what I meant when I wrote that the values come from different processes. We could/should opt for a mixed distribution approach (e.g what Zuur et al. refer to as ZAP models, in the book that Gavin mentioned). Or we could choose to transform the 0s to match the biomass of the surviving individuals. Here come's Philippe's criticism of the two rules-of-thumb: We can use grams, or milligrams, or tons, and all the time the c-value would change. Correct: it would. And rightly so, I think. Why should the value of c be some natural constant (such as 1)? Of course we seek to adjust it to the distribution of the data, because we are actually imputing (in a sense) a value we know that cannot be right as it is. Therefore it seems obvious to me that we must have rules of thumb that provide different values for different data sets. In fact, c=1 is not a constant value. Because we log-transform the data, the c=1 added to 1 is large (turning a log(1)=0 into a log(2)=0.7), while the same c=1 added to 28377 is small (turing a log(28377)=10.25333 into a 10.25337). I think this gives an impression of the distortion we add: negligible at the upper end, substantial at the lower. The second rule of thumb tries to balance that.

And, really, these rules of thumb are not "VERY complicated":
for c(342, 234, 132, 1441, 2, 4443, 23434, 0) rule 1 proposed c=1 (as in 2/2) and rule 2 proposes 4.5.
quantile() yields:
 0%     25%     50%     75%    100%
 0.0    99.5   288.0  2191.5 23434.0
99.5*99.5/2191.5 = 4.5

Do you think that is VERY complicated?

Anyone volunteering for a little simulation study, analysing which rule-of-thumb would be best? Meet the candidates:
1. Eliminator (get rid of zeros)
2. Oner (log1p)
3. little-bitter (half of the smallest non-zero value)
4. quantiler (ratio of squared first and third quantile)

set.seed(11111)
x1 <- rlnorm(100, 5, 2)
x2 <- rlnorm(100, 5.5, 2)
t.test(log(x1), log(x2))
x1[sample(1:100, size=40)] <- 0
x2[sample(1:100, size=40)] <- 0

1. Eliminator:
t.test(log(x1[x1>0]), log(x2[x2>0])) #significant
2. Oner:
t.test(log1p(x1), log1p(x2)) # not significant
3. little-bitter:
sort(unique(c(x1,x2)))[2]/2 # approx. 0.6
t.test(log(x1+.6), log(x2+.6)) # not significant
4. quantiler
quantile(c(x1,x2))
# Ah, now that is interesting: too many zeros and you cannot use this rule-of-thumb!
quantile(c(x1,x2)[c(x1,x2)>0])
# (34^2)/541 = 2.1
t.test(log(x1+2.1), log(x2+2.1)) # not significant

I wouldn't go so far as to claim that this is a real test of the contestants, it merely outlines a possible approach for doing so. In any case, neither rule-of-thump leads to "true" significance, only the eliminator!

I'll be on holiday for the next weeks, in order to avoid all further discussions ...

Carsten



Philippe Grosjean wrote:
Carsten Dormann wrote:
Dear Nate,

although I learned from Phillippe's response about the existence of log1p, I don't think I will use it (for reasons below). Thierry's response is true for Poisson data, but not for non-integer values. Still, it points into an important direction: All too often zeros emanate from a different process than the other values (see mixed distributions, zero-inflated, hurdle and all that). In that case, you should consult Ben Bolker's excellent book (which is probably still available as a draft on his homepage, but also worth buying).

If you want to transform, here is my take:

My folk-law guidelines on the c in log(x+c) are:
1. c should roughly be 1/2 of the smallest, non-zero value: signif(0.5*sort(unique(x))[2], 2) 2. c should be quadrat of the first quantile devided by the third quantile: (quantile(x)[2]^2)/quantile(x)[4]
For example:
set.seed(11011)
x <- c(runif(95), rep(0,5))

Method 1: c=0.0015
Method 2: c=0.015
While this looks like a huge difference (an order of magnitude), it actually isn't all that much, given the range of the data:

plot(density(x))
abline(v=c(0.0015, 0.015))

These are VERY complicated rules for just an empirical rule of thumb without connection to theoretical background. Moreover, c is depending on the dataset, and it is thus changing from dataset to dataset, which is NOT a desired behavior.

So, providing maximum values are large (100s or more), and minimum value above zero not too small (let's 1, e.g., for countings), ln(x+1) alias log1p() is convenient because log1p(0) = 0. So, given we use rules of thumb, this one looks good because (1) it transforms zero to zero, and (2) transformation is independent from the content of the data. But I agree it is not a good choice when you deal with small values.

Now, if you want to be more accurate, you have to determine the actual distribution of your data. If your data are (generalized) log-normally distributed, c must be defined according to what you know about the variable you measure. For instance, temperature expressed in °C would require to choose c as being 273.15 to be correct... very far away from 1, or from c that would be used with the rules you propose!

Best,

Philippe

I do have a reference for method 2, but it is German (Stahel, W. A. (2002) Statistische Datenanalyse. Eine Einführung für Naturwissenschaftler. Vieweg, Braunschweig.). _ Method 1 is what my PhD's statistics adviser recommended. Since he was right in everything else, I rely on his advise here, too. That may, I acknowledge, not be good enough for you. But maybe someone else finds a proper reference.

The key thing for any value of c is that it doesn't distort the analysis. But then, how do you detect distortion? I used a comparison of rank-transformed data and various values of c. When c was large (in the current example e.g. 0.5 or so), the analysis started to differ from the rank-analysis. To use log1p here would be a dramatic distortion!

Another way to look at it is through Box-Cox-transformation. Since Box-Cox transforms towards symmetric (not necessarily normal) distribution, also c should be chosen in such a way as to facilitate the transformation towards symmetry.

HTH,

Carsten


Nate Upham wrote:
I have a general stats question for you guys:

How does one normally deal with zero (0) values when log transforming data? I would like to log transform (natural log, ln) several response variables for use in quantile regression. But one of my variables includes several zero values. Since ln(0) = infinity, this is not readily possible. Is it best to remove all data with zero values? Or should I add a very small number to each value (e.g., 0.00001)? This seems problematic. Is there an easy way to address this
issue?

Thanks much for your help,
--Nate

_________________________________
Nathan S. Upham
Ph.D. student
Committee on Evolutionary Biology
University of Chicago
1025 E. 57th St., Culver 402
Chicago, IL 60637
nsup...@uchicago.edu

_______________________________________________
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology




--
Dr. Carsten F. Dormann
Department of Computational Landscape Ecology
Helmholtz Centre for Environmental Research-UFZ Permoserstr. 15
04318 Leipzig
Germany

Tel: ++49(0)341 2351946
Fax: ++49(0)341 2351939
Email: carsten.dorm...@ufz.de
internet: http://www.ufz.de/index.php?de=4205

_______________________________________________
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

Reply via email to